update: 549 fp8 (Linear-only), Conv weights stay bf16

6bca5c7 verified 14 days ago

6.28 kB

	---
	license: mit
	base_model:
	- hanshanxue/WorldStereo
	- Wan-AI/Wan2.1-I2V-14B-480P-Diffusers
	tags:
	- safetensors
	- fp8
	- quantized
	- video
	- image-to-video
	- wan
	- worldstereo
	pipeline_tag: image-to-video
	---

	# WorldStereo-memory-dmd-fp8_e4m3fn_scaled

	A scaled FP8 (e4m3fn) quantization of `hanshanxue/WorldStereo`'s
	`worldstereo-memory-dmd` variant. Drops VRAM from 35 GB → 20 GB so the
	14B WorldStereo DiT fits resident on 24 GB consumer GPUs (3090, 4090) instead
	of needing partial-load streaming.

	\| \| bf16 source \| fp8_e4m3fn_scaled (this repo) \|
	\|---\|---\|---\|
	\| Size on disk \| 34.86 GB \| 20.35 GB (-41.6%) \|
	\| Tensors \| 1799 (all BF16) \| 551 FP8 + 1800 BF16 \|
	\| Fits in 24 GB VRAM \| No (needs partial_load) \| Yes (resident) \|
	\| Quality \| reference \| typical fp8_scaled drop: <1% perceptible on short clips, faint color drift / micro-detail loss possible on long sequences ([refs](#quality)) \|
	\| Speed on Ampere (3090) \| reference \| same (no native fp8 matmul; weight upcast to bf16 per matmul) \|
	\| Speed on Ada/Hopper (4090, H100) \| reference \| >2x via `torch._scaled_mm` \|

	## Quantization recipe (verbatim from [kijai/ComfyUI-WanVideoWrapper](https://github.com/kijai/ComfyUI-WanVideoWrapper))

	For each tensor `T` in the source state-dict:

	1. If `T`'s name contains any of these keywords -> stored unchanged in BF16:
	```
	norm, bias, time_in, time_, patch_embedding, img_emb, modulation,
	text_embedding, adapter, add, ref_conv, audio_proj
	```
	This keeps numerically-sensitive small tensors (norms, biases, modulation,
	I2V cross-attn projections) at full precision.

	2. If `T` is `.weight` AND has rank >= 2 AND survived the exclusion check ->
	cast to fp8_e4m3fn with per-tensor scale:
	```python
	FP8_MAX = 448.0
	scale = T.float().abs().amax() / FP8_MAX
	T_fp8 = (T.float() / scale).clamp(-FP8_MAX, FP8_MAX).to(torch.float8_e4m3fn)
	```
	Stored as two tensors:
	- `<key>` (the fp8 weight)
	- `<key>.scale_weight` (bfloat16 scalar)

	3. A marker tensor `scaled_fp8` (bf16 zero) is added so loaders that check
	`"scaled_fp8" in state_dict` auto-detect the scaled-fp8 format.

	## Inference

	At runtime the scale must be re-applied. Two paths:

	### Ampere (3090) - upcast on matmul
	```python
	def fp8_scaled_linear(x, w_fp8, scale, bias):
	return torch.nn.functional.linear(x, w_fp8.to(x.dtype) * scale, bias)
	```

	### Ada/Hopper (4090, H100) - native fp8 matmul
	```python
	def fp8_scaled_linear_fast(x, w_fp8, scale_w, bias, base_dtype=torch.bfloat16):
	x = x.clamp(-448, 448).to(torch.float8_e4m3fn).contiguous()
	scale_in = torch.ones((), device=x.device, dtype=torch.float32)
	return torch._scaled_mm(x.view(-1, x.shape[-1]), w_fp8.t(),
	out_dtype=base_dtype, bias=bias,
	scale_a=scale_in, scale_b=scale_w)
	```

	Reference implementations:
	- `kijai/ComfyUI-WanVideoWrapper/fp8_optimization.py` (fast path)
	- `kijai/ComfyUI-WanVideoWrapper/nodes_model_loading.py` `_replace_linear` (slow path)

	## Naming convention

	This file uses diffusers-style keys (`blocks.0.attn1.to_q.weight`,
	`blocks.0.ffn.net.0.proj.weight`), matching the original `hanshanxue/WorldStereo`
	source layout. This is different from Kijai's published Wan2.1 fp8 files which
	use native Wan keys (`blocks.0.self_attn.q.weight`). If your loader expects
	native names, run a key remap (see `diffusers_to_native_wan` in
	[ComfyUI-WorldStereo](https://github.com/pozzettiandrea/ComfyUI-WorldStereo)).

	## What's in the file

	- 549 FP8 tensors (14.5 GB): rank-2 Linear weights only (self-attn
	Q/K/V/O, FFN projections, controlnet linears). Conv weights stay BF16:
	no scaled-fp8 Conv kernel exists in PyTorch / ComfyUI
	- 1800 BF16 tensors (5.84 GB):
	- 362 `norm` tensors (RMSNorm gains): 1.26 GB
	- 80 `add_*` tensors (I2V cross-attention to image embeddings): 4.19 GB
	- 3 `time_*` tensors (timestep MLP): 0.37 GB
	- 759 `bias` + others: ~0.02 GB
	- 551 `*.scale_weight` scalars: 1.1 MB
	- 1 marker tensor `scaled_fp8`

	## Quality

	This recipe (Kijai's fp8_e4m3fn_scaled) is the community standard for Wan2.1
	14B inference on 24 GB GPUs. Documented behavior:

	- Subjective: indistinguishable from bf16 on most short clips (Kijai's
	own statement: "never any difference remotely [from fp16]" in his
	per-tensor-scaled tests).
	- Known artifacts on long clips: faint color drift, occasional minor
	detail loss; usually not visible side-by-side without close inspection.
	References:
	- [Comfy-Org repackaged: fp8_e4m3fn vs fp8_scaled discussion](https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/discussions/22)
	- [Kijai WanVideo_comfy_fp8_scaled artifacts discussion](https://huggingface.co/Kijai/WanVideo_comfy_fp8_scaled/discussions/31)
	- Round-trip on a sample 5120x5120 attention weight: max_abs_err = 0.005
	(3.7% of the tensor's max value, ~standard for e4m3fn-scaled).

	If you observe quality regressions specific to WorldStereo's camera/scene
	conditioning, fall back to bf16 with partial-load streaming.

	## Provenance

	- Source: `hanshanxue/WorldStereo` (`worldstereo-memory-dmd` variant, commit `2adb716`)
	- Source base model: `Wan-AI/Wan2.1-I2V-14B-480P-Diffusers` (Apache-2.0)
	- Verified: WorldStereo's Wan base is frozen (sampled tensors differ
	from upstream only at fp16->bf16 ULPs). The novel content is the 1.76 GB
	controlnet branch under `controlnet.controlnet_blocks.*` -- WorldStereo's
	camera/scene-render conditioning.
	- Quantization tool: `cook_worldstereo_fp8.py` (in this repo)

	## License

	- WorldStereo overlay: MIT (inherited from `hanshanxue/WorldStereo`)
	- Wan2.1 base weights: Apache-2.0 (inherited from `Wan-AI/Wan2.1-I2V-14B-480P-Diffusers`)

	A NOTICE file in this repo provides Apache-2.0 attribution and a statement of
	changes per section 4 of the license.

	## Files

	- `WorldStereo-memory-dmd-fp8_e4m3fn_scaled.safetensors` - the quantized model
	- `cook_worldstereo_fp8.py` - the quantization script (reproducible)
	- `config.json` - the source `worldstereo-memory-dmd` config (carries the
	scale_map, base_model path, sampling settings)
	- `README.md` - this file
	- `NOTICE` - Apache-2.0 attribution
	- `LICENSE` - MIT (for the WorldStereo overlay portion)