FlowMo-WM / experiments /EXPERIMENT_MATRIX.md

Clean public repository for reproducibility

8e384df verified 7 days ago

5.35 kB

	# FlowMo Paper Experiment Matrix

	This document defines the paper-facing experiments. File and run names avoid version suffixes; regenerated artifacts replace the same public paths.

	## Shared Data And Observation Protocol

	All learned world models use the same simulator data and clean-image observation pipeline.

	```text
	Image input: clean top-down RGB boat image
	Image size: 160 x 160
	Visual scale: 2.5
	Forbidden image cues: flow arrows, velocity vectors, trajectory overlays, goal marker
	Train split: data/paper/train.npz
	Test split: data/paper/test.npz
	Flow families: noflow, uniform, vortex_center, double_gyre, source_sink, source_sink_pair, gradient, shear, turbulent_patch, random_fourier
	Config: experiments/shared/config/paper_image.json
	Checkpoint: paper.pt
	Intermediate checkpoints: paper_step_XXXXXX.pt
	```

	All flow fields are static. Localized flow structures are sampled near the
	route corridors used by the training controllers and final planning tasks.

	Formal training budget:

	```text
	train_episodes: 2400
	test_episodes: 480
	train_windows: 393216
	test_windows: 24576
	batch_size: 256
	steps: 20000
	checkpoint_interval: 2000
	num_workers: 4
	render_mode: device
	training_parallel_jobs: 2
	planning_parallel_jobs: 3
	```

	Precision policy:

	```text
	training: bf16 model autocast, fp32 losses and metrics
	prediction_eval: bf16 model autocast, fp32 metrics
	planning_eval: fp32
	```

	## A. Learned World-Model Comparison

	Purpose: measure world-model quality directly. The key question is whether FlowMo's short object-motion state plus long ambient-drift context improves rollout prediction under hidden currents and momentum.

	\| Method \| Comparison Role \| What It Tests \|
	\|---\|---\|---\|
	\| `flowmo` \| Proposed WM \| Explicit flow-momentum factorization: short state/momentum latent, long drift context, zero-context residual transition. \|
	\| `leworldmodel` \| JEPA-style WM baseline \| Whether simple image-latent prediction without explicit history/context can handle boat momentum and flow. \|
	\| `planet` \| RSSM WM baseline \| Whether generic recurrent latent memory can represent momentum and drift without a separate context factor. \|
	\| `tdmpc2` \| Compact latent-dynamics WM baseline \| Whether a compact action-conditioned latent transition matches FlowMo under equal supervision. \|

	Prediction dataset:

	```text
	test
	```

	Prediction metrics:

	```text
	pos@1, pos@5, pos@10, pos@20, pos@40, pos@60
	heading@20, heading@60
	zero-action drift error
	no-flow momentum decay error
	same-action different-flow error
	```

	FlowMo context diagnostics:

	\| Diagnostic \| Operation \| Evidence Sought \|
	\|---\|---\|---\|
	\| Inferred context \| Normal rollout with inferred `c_t` \| Best prediction under flow. \|
	\| Zero context \| Set `c_t=0` \| Degraded flow prediction, smaller change in no-flow. \|
	\| Shuffled context \| Use context from another episode \| Worse rollout when hidden flow differs. \|
	\| Same-flow transfer \| Use context from another episode with the same hidden flow \| Better than wrong-flow context transfer. \|
	\| No-flow context norm \| Measure `\|\|c_t\|\|` on no-flow data \| Smaller than flow context norm. \|
	\| Context PCA \| Plot `c_t` by flow family / flow id \| Flow-related organization. \|

	FlowMo latent probes:

	\| Probe Target \| Feature Sets \| Purpose \|
	\|---\|---\|---\|
	\| Object momentum `(vx, vy, omega)` \| `z_t`, `c_t`, `[z_t,c_t]` \| Tests whether short-history state contains object motion. \|
	\| Local flow vector \| `z_t`, `c_t`, `[z_t,c_t]` \| Tests whether state plus context exposes local ambient drift. \|
	\| Episode drift vector \| `z_t`, `c_t`, `[z_t,c_t]` \| Tests whether long context contains environment-level drift. \|

	## B. Traditional Non-WM Control Comparison

	Purpose: provide downstream control references and report practical task behavior. The central WM claim still comes from A; B shows whether prediction differences matter for planning and control.

	Learned WM planners:

	```text
	flowmo
	leworldmodel
	planet
	tdmpc2
	```

	Traditional non-WM controllers:

	\| Method \| Comparison Role \| What It Tests \|
	\|---\|---\|---\|
	\| `pid_los_controller` \| Simple classical controller \| Baseline waypoint tracking without learned dynamics. \|
	\| `no_flow_los_controller` \| No-flow LOS controller \| Effect of ignoring hidden current in a geometric controller. \|
	\| `current_estimator_los_controller` \| Current-estimator LOS controller \| Strength of a hand-designed drift estimator in a geometric controller. \|
	\| `oracle_flow_los_controller` \| Oracle-flow LOS controller \| Effect of true local flow feed-forward in a geometric controller. \|

	Planning tasks:

	```text
	reach_target
	station_keeping
	waypoint_square
	waypoint_zigzag
	```

	Boats:

	```text
	twin
	triangle
	```

	Flow families:

	```text
	noflow
	uniform
	vortex_center
	double_gyre
	source_sink
	source_sink_pair
	gradient
	shear
	turbulent_patch
	random_fourier
	```

	Planning metrics:

	```text
	success rate
	final distance
	trajectory length over successful episodes
	control effort (`sum_t \|\|a_t\|\|_2^2`) over successful episodes
	time to goal over successful episodes
	```

	Formal commands:

	```bash
	python -m experiments.run_paper_image_pipeline --stages train
	python -m experiments.run_paper_image_pipeline --stages prediction
	python -m experiments.run_paper_image_pipeline --stages probe
	python -m experiments.run_paper_image_pipeline --stages planning
	python -m experiments.run_paper_image_pipeline --stages report
	```