FlowMo-WM / experiments /docs /EXPERIMENT_PROTOCOL.md

Clean public repository for reproducibility

8e384df verified 8 days ago

7.01 kB

	# FlowMo Experiment Protocol

	This document is the single paper-facing record for the public FlowMo experiments. Regenerated artifacts should replace the same paths instead of introducing version suffixes.

	## Scope

	The project has two formal comparison groups.

	### A. Learned World Models

	Purpose: compare image-input world-model architectures under the same simulator data, optimizer budget, rollout target, parameter scale, and planning interface.

	\| Directory \| Report Name \| Architecture \| Comparison Purpose \|
	\|---\|---\|---\|---\|
	\| `flowmo` \| FlowMo \| Shared image encoder; short object-motion state encoder; long strided ambient-drift context encoder; base transition plus zero-context residual \| Proposed flow-momentum factorization. Tests whether separating endogenous motion from exogenous drift improves prediction and planning. \|
	\| `leworldmodel` \| LeWorldModel \| JEPA-style image-latent predictor with action-conditioned residual transition \| Tests whether simple current-image latent prediction is sufficient. \|
	\| `planet` \| RSSM \| Recurrent state-space latent model with deterministic memory and stochastic latent state \| Tests whether generic recurrent memory can absorb momentum and drift without a separate context factor. \|
	\| `tdmpc2` \| TD-MPC2 Dynamics \| Compact action-conditioned latent dynamics with shared image encoder and rollout heads \| Tests task-oriented latent dynamics under equal supervision. \|

	All learned methods receive clean top-down RGB boat images and action history. They do not receive flow labels, flow arrows, velocity vectors, trajectory overlays, or goal markers in the image.

	### B. Traditional Non-WM Controllers

	Purpose: compare downstream behavior against non-neural controllers that do not train a world model.

	\| Directory \| Report Name \| Input \| Comparison Purpose \|
	\|---\|---\|---\|---\|
	\| `pid_los_controller` \| PID/LOS \| Clean image pose estimate \| Simple hand-designed waypoint tracking baseline. \|
	\| `no_flow_los_controller` \| No-Flow LOS Controller \| Clean image pose estimate \| Measures the cost of ignoring ambient current. \|
	\| `current_estimator_los_controller` \| Current-Estimator LOS Controller \| Clean image pose estimate and recent drift \| Strong classical current-compensation baseline. \|
	\| `oracle_flow_los_controller` \| Oracle-Flow LOS Controller \| Clean image pose estimate and simulator local flow \| True-local-flow feed-forward reference for a simple geometric controller, not a full dynamics-MPC upper bound. \|

	## Data

	All methods use the same splits:

	```text
	train: data/paper/train.npz
	test: data/paper/test.npz
	dataset_card: data/paper/dataset_card.md
	generation_config: data/paper/generation_config.json
	```

	The train split, test split, and final planning evaluation use the same paper
	flow-family set: `noflow`, `uniform`, `vortex_center`, `double_gyre`,
	`source_sink`, `source_sink_pair`, `gradient`, `shear`, `turbulent_patch`, and
	`random_fourier`.

	All paper flow fields are static. Localized structures are sampled near common
	task routes and waypoint corridors so that non-uniform flow is encountered by
	the boat during both training trajectories and final planning tasks.

	Observation protocol:

	```text
	image_size: 160 x 160
	visual_scale: 2.5
	rendering: online clean top-down RGB images
	forbidden cues: flow arrows, velocity vectors, trajectory overlays, goal markers
	```

	Training budget:

	```text
	train_episodes: 2400
	test_episodes: 480
	train_windows: 393216
	test_windows: 24576
	batch_size: 256
	steps: 20000
	checkpoint_interval: 2000
	num_workers: 4
	render_mode: device
	training_parallel_jobs: 2
	planning_parallel_jobs: 3
	```

	Precision policy:

	```text
	training: bf16 model autocast, fp32 losses and metrics
	prediction_eval: bf16 model autocast, fp32 metrics
	planning_eval: fp32
	```

	The precision split is intentional: BF16 speeds up image encoding and latent rollout on the RTX 5090 without measurable short-run loss drift, while CEM planning is dominated by small control tensors and did not improve under BF16.

	## Prediction Evaluation

	Dataset:

	```text
	test
	```

	Metrics:

	```text
	pos@1, pos@5, pos@10, pos@20, pos@40, pos@60
	heading@20, heading@60
	zero-action drift prediction error
	no-flow momentum decay prediction error
	same-action different-flow prediction error
	```

	FlowMo-only context diagnostics:

	\| Diagnostic \| Operation \| Evidence Sought \|
	\|---\|---\|---\|
	\| Inferred context \| Normal rollout with inferred `c_t` \| Best prediction under flow. \|
	\| Zero context \| Set `c_t=0` \| Degraded flow prediction and limited change in no-flow. \|
	\| Shuffled context \| Use context from another episode \| Worse rollout when hidden flow differs. \|
	\| Same-flow transfer \| Use context from another episode with the same hidden flow \| Better transfer than wrong-flow context. \|
	\| Context norm \| Compare no-flow and flow `\|\|c_t\|\|` \| Flow context should be larger than no-flow context. \|

	FlowMo latent probes:

	```text
	Train frozen linear probes from z_t, c_t, and [z_t,c_t].
	Targets: object momentum (vx, vy, omega), local flow vector, episode drift vector.
	Purpose: verify which latent carries object-motion information and which latent carries ambient-drift information.
	```

	## Planning Evaluation

	Learned WM planners:

	```text
	flowmo
	leworldmodel
	planet
	tdmpc2
	```

	All learned world models use the same route-aware CEM planner over their latent rollouts.

	Traditional non-WM controllers:

	```text
	pid_los_controller
	no_flow_los_controller
	current_estimator_los_controller
	oracle_flow_los_controller
	```

	Tasks:

	```text
	reach_target
	station_keeping
	waypoint_square
	waypoint_zigzag
	```

	Boats:

	```text
	twin
	triangle
	```

	Flow families:

	```text
	noflow
	uniform
	vortex_center
	double_gyre
	source_sink
	source_sink_pair
	gradient
	shear
	turbulent_patch
	random_fourier
	```

	Metrics:

	```text
	success rate
	final distance
	trajectory length over successful episodes
	control effort (`sum_t \|\|a_t\|\|_2^2`) over successful episodes
	time to goal over successful episodes
	```

	## Required Outputs

	Training outputs:

	```text
	experiments/<method>/checkpoint/paper.pt
	experiments/<method>/checkpoint/paper_step_*.pt
	experiments/<method>/result/parameter_count.json
	experiments/<method>/result/paper_training.json
	experiments/<method>/result/paper_training_trace.jsonl
	```

	Evaluation outputs:

	```text
	experiments/reports/paper_prediction.json
	experiments/reports/paper_flowmo_latent_probes.json
	experiments/reports/paper_planning/*.json
	experiments/reports/paper_planning/gifs/*.gif
	experiments/reports/paper_report.md
	```

	## Commands

	Run the complete paper pipeline:

	```bash
	python -m experiments.run_paper_image_pipeline
	```

	Run stages separately:

	```bash
	python -m experiments.run_paper_image_pipeline --stages train
	python -m experiments.run_paper_image_pipeline --stages prediction
	python -m experiments.run_paper_image_pipeline --stages probe
	python -m experiments.run_paper_image_pipeline --stages planning
	python -m experiments.run_paper_image_pipeline --stages report
	```

	Run tests:

	```bash
	python -m pytest -q
	```