FlowMo-WM / experiments /docs /EXPERIMENT_PROTOCOL.md
cccat6's picture
Clean public repository for reproducibility
8e384df verified
# FlowMo Experiment Protocol
This document is the single paper-facing record for the public FlowMo experiments. Regenerated artifacts should replace the same paths instead of introducing version suffixes.
## Scope
The project has two formal comparison groups.
### A. Learned World Models
Purpose: compare image-input world-model architectures under the same simulator data, optimizer budget, rollout target, parameter scale, and planning interface.
| Directory | Report Name | Architecture | Comparison Purpose |
|---|---|---|---|
| `flowmo` | FlowMo | Shared image encoder; short object-motion state encoder; long strided ambient-drift context encoder; base transition plus zero-context residual | Proposed flow-momentum factorization. Tests whether separating endogenous motion from exogenous drift improves prediction and planning. |
| `leworldmodel` | LeWorldModel | JEPA-style image-latent predictor with action-conditioned residual transition | Tests whether simple current-image latent prediction is sufficient. |
| `planet` | RSSM | Recurrent state-space latent model with deterministic memory and stochastic latent state | Tests whether generic recurrent memory can absorb momentum and drift without a separate context factor. |
| `tdmpc2` | TD-MPC2 Dynamics | Compact action-conditioned latent dynamics with shared image encoder and rollout heads | Tests task-oriented latent dynamics under equal supervision. |
All learned methods receive clean top-down RGB boat images and action history. They do not receive flow labels, flow arrows, velocity vectors, trajectory overlays, or goal markers in the image.
### B. Traditional Non-WM Controllers
Purpose: compare downstream behavior against non-neural controllers that do not train a world model.
| Directory | Report Name | Input | Comparison Purpose |
|---|---|---|---|
| `pid_los_controller` | PID/LOS | Clean image pose estimate | Simple hand-designed waypoint tracking baseline. |
| `no_flow_los_controller` | No-Flow LOS Controller | Clean image pose estimate | Measures the cost of ignoring ambient current. |
| `current_estimator_los_controller` | Current-Estimator LOS Controller | Clean image pose estimate and recent drift | Strong classical current-compensation baseline. |
| `oracle_flow_los_controller` | Oracle-Flow LOS Controller | Clean image pose estimate and simulator local flow | True-local-flow feed-forward reference for a simple geometric controller, not a full dynamics-MPC upper bound. |
## Data
All methods use the same splits:
```text
train: data/paper/train.npz
test: data/paper/test.npz
dataset_card: data/paper/dataset_card.md
generation_config: data/paper/generation_config.json
```
The train split, test split, and final planning evaluation use the same paper
flow-family set: `noflow`, `uniform`, `vortex_center`, `double_gyre`,
`source_sink`, `source_sink_pair`, `gradient`, `shear`, `turbulent_patch`, and
`random_fourier`.
All paper flow fields are static. Localized structures are sampled near common
task routes and waypoint corridors so that non-uniform flow is encountered by
the boat during both training trajectories and final planning tasks.
Observation protocol:
```text
image_size: 160 x 160
visual_scale: 2.5
rendering: online clean top-down RGB images
forbidden cues: flow arrows, velocity vectors, trajectory overlays, goal markers
```
Training budget:
```text
train_episodes: 2400
test_episodes: 480
train_windows: 393216
test_windows: 24576
batch_size: 256
steps: 20000
checkpoint_interval: 2000
num_workers: 4
render_mode: device
training_parallel_jobs: 2
planning_parallel_jobs: 3
```
Precision policy:
```text
training: bf16 model autocast, fp32 losses and metrics
prediction_eval: bf16 model autocast, fp32 metrics
planning_eval: fp32
```
The precision split is intentional: BF16 speeds up image encoding and latent rollout on the RTX 5090 without measurable short-run loss drift, while CEM planning is dominated by small control tensors and did not improve under BF16.
## Prediction Evaluation
Dataset:
```text
test
```
Metrics:
```text
pos@1, pos@5, pos@10, pos@20, pos@40, pos@60
heading@20, heading@60
zero-action drift prediction error
no-flow momentum decay prediction error
same-action different-flow prediction error
```
FlowMo-only context diagnostics:
| Diagnostic | Operation | Evidence Sought |
|---|---|---|
| Inferred context | Normal rollout with inferred `c_t` | Best prediction under flow. |
| Zero context | Set `c_t=0` | Degraded flow prediction and limited change in no-flow. |
| Shuffled context | Use context from another episode | Worse rollout when hidden flow differs. |
| Same-flow transfer | Use context from another episode with the same hidden flow | Better transfer than wrong-flow context. |
| Context norm | Compare no-flow and flow `||c_t||` | Flow context should be larger than no-flow context. |
FlowMo latent probes:
```text
Train frozen linear probes from z_t, c_t, and [z_t,c_t].
Targets: object momentum (vx, vy, omega), local flow vector, episode drift vector.
Purpose: verify which latent carries object-motion information and which latent carries ambient-drift information.
```
## Planning Evaluation
Learned WM planners:
```text
flowmo
leworldmodel
planet
tdmpc2
```
All learned world models use the same route-aware CEM planner over their latent rollouts.
Traditional non-WM controllers:
```text
pid_los_controller
no_flow_los_controller
current_estimator_los_controller
oracle_flow_los_controller
```
Tasks:
```text
reach_target
station_keeping
waypoint_square
waypoint_zigzag
```
Boats:
```text
twin
triangle
```
Flow families:
```text
noflow
uniform
vortex_center
double_gyre
source_sink
source_sink_pair
gradient
shear
turbulent_patch
random_fourier
```
Metrics:
```text
success rate
final distance
trajectory length over successful episodes
control effort (`sum_t ||a_t||_2^2`) over successful episodes
time to goal over successful episodes
```
## Required Outputs
Training outputs:
```text
experiments/<method>/checkpoint/paper.pt
experiments/<method>/checkpoint/paper_step_*.pt
experiments/<method>/result/parameter_count.json
experiments/<method>/result/paper_training.json
experiments/<method>/result/paper_training_trace.jsonl
```
Evaluation outputs:
```text
experiments/reports/paper_prediction.json
experiments/reports/paper_flowmo_latent_probes.json
experiments/reports/paper_planning/*.json
experiments/reports/paper_planning/gifs/*.gif
experiments/reports/paper_report.md
```
## Commands
Run the complete paper pipeline:
```bash
python -m experiments.run_paper_image_pipeline
```
Run stages separately:
```bash
python -m experiments.run_paper_image_pipeline --stages train
python -m experiments.run_paper_image_pipeline --stages prediction
python -m experiments.run_paper_image_pipeline --stages probe
python -m experiments.run_paper_image_pipeline --stages planning
python -m experiments.run_paper_image_pipeline --stages report
```
Run tests:
```bash
python -m pytest -q
```