EditAnything / lora_layers_reference.md

Upload folder using huggingface_hub

775562c verified 1 day ago

8.42 kB

	# LoRA Layer Inventory — Edit Anything checkpoints

	Inventory of every tensor in two builds of the
	`edit_anything_reference_v0.1_r128` LoRA.

	Both builds share the same canonical basename
	(`edit_anything_reference_v0.1_r128`) and are distinguished by the **extras
	suffix** that `scripts/split_editanything_lora.py` appends to the output
	filenames:

	- `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.{standard,module}.safetensors`
	— the original build. Only ships `ref_adaln_proj` + `role_embedding`.
	- `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.{standard,module}.safetensors`
	— the continuation, fine-tuned with the
	`video_to_video_ref_visual_adaln` strategy. Adds the `ref_attn` LoRA
	branch and the `ref_visual_proj` projector on top of the original
	extras.

	In the rest of this doc the two are referred to by their suffix only:
	- `ref_adaln_proj-role_embedding`
	- `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`

	Rank is 128 in both (encoded in the LoRA tensor shapes; no `alpha` keys saved).
	Dtype is `bfloat16` throughout. All LoRA modules cover 48 transformer blocks.

	---

	## 1. Summary

	\| \| `ref_adaln_proj-role_embedding` \| `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` \|
	\|---\|---\|---\|
	\| Total tensors \| 965 \| 1356 \|
	\| LoRA-target modules \| 10 \| 14 \|
	\| LoRA tensors (A+B) \| 960 \| 1344 \|
	\| Extra (non-LoRA) tensors \| 5 \| 12 \|
	\| `ref_attn` LoRA branch \| ❌ absent \| ✅ trained on 48 blocks \|
	\| `ref_visual_proj` (visual cross-attn projector) \| ❌ absent \| ✅ present (7 tensors) \|
	\| `ref_adaln_proj` (global appearance AdaLN) \| ✅ (fc1 input dim 256) \| ✅ (fc1 input dim 768) \|
	\| `role_embedding` \| ✅ shape (1, 128) \| ✅ shape (1, 128) \|

	---

	## 2. LoRA adapters

	Each row = one target module type. Each entry = (`lora_A.weight`, `lora_B.weight`)
	duplicated across the 48 blocks of `diffusion_model.transformer_blocks.*`.

	\| Module \| `ref_adaln_proj-role_embedding` \| `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` \| Notes \|
	\|---\|:---:\|:---:\|---\|
	\| `attn1.to_q` \| ✅ \| ✅ \| self-attention query \|
	\| `attn1.to_k` \| ✅ \| ✅ \| self-attention key \|
	\| `attn1.to_v` \| ✅ \| ✅ \| self-attention value \|
	\| `attn1.to_out.0` \| ✅ \| ✅ \| self-attention output proj \|
	\| `attn2.to_q` \| ✅ \| ✅ \| cross-attention to text (Gemma) \|
	\| `attn2.to_k` \| ✅ \| ✅ \| \|
	\| `attn2.to_v` \| ✅ \| ✅ \| \|
	\| `attn2.to_out.0` \| ✅ \| ✅ \| \|
	\| `ff.net.0.proj` \| ✅ \| ✅ \| feed-forward up-projection \|
	\| `ff.net.2` \| ✅ \| ✅ \| feed-forward down-projection \|
	\| `ref_attn.to_q` \| — \| ✅ \| new — visual reference cross-attention \|
	\| `ref_attn.to_k` \| — \| ✅ \| new \|
	\| `ref_attn.to_v` \| — \| ✅ \| new \|
	\| `ref_attn.to_out.0` \| — \| ✅ \| new \|

	Key naming: `diffusion_model.transformer_blocks.{0..47}.{module}.{lora_A\|lora_B}.weight`

	Training freeze policy for the
	`ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` build
	(per `stage2_ref_visual_adaln_crossattn_from_v01_r128.yaml`):
	- `attn1.*` adapters loaded from the `ref_adaln_proj-role_embedding` build
	but frozen (`trainable_include_patterns` excludes them).
	- `attn2.`, `ff.`, `ref_attn.*` are trainable.

	---

	## 3. Non-LoRA modules (the module sidecar)

	These tensors live at the top of the state dict (no `transformer_blocks.*` prefix)
	and are consumed by the custom inference path (`LTXVEditAnythingModuleLoader` +
	`LTXVEditAnythingLoopingSampler`), not by the standard ComfyUI LoRA loader.

	### 3.1. `role_embedding` — appearance role bias

	\| Key \| Shape \| Notes \|
	\|---\|---\|---\|
	\| `role_embedding.embedding.weight` \| (1, 128) \| 1 slot (appearance). Padded to (3, 128) at inference; entry stored at slot 1 (ref_img role). \|

	Present in both builds with the same shape. In the
	`ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` build it is
	frozen (`use_visual_ref_role_embedding: false`); wandb shows its norm
	stays flat at ~0.125 throughout training.

	### 3.2. `ref_adaln_proj` — global AdaLN appearance anchor

	Two-layer MLP that pools the reference latent into a vector added to every
	block's AdaLN timestep bias.

	\| Key \| `ref_adaln_proj-role_embedding` shape \| `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` shape \|
	\|---\|---\|---\|
	\| `ref_adaln_proj.fc1.weight` \| (512, 256) \| (512, 768) \|
	\| `ref_adaln_proj.fc1.bias` \| (512,) \| (512,) \|
	\| `ref_adaln_proj.proj.weight` \| (36864, 512) \| (36864, 512) \|
	\| `ref_adaln_proj.proj.bias` \| (36864,) \| (36864,) \|

	> ⚠️ Shape mismatch on `fc1.weight`.
	> The `ref_adaln_proj-role_embedding` build was trained with a 2-scale pool
	> (`avg_1x1 ‖ max_1x1` → 256-dim input).
	> The `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` build was
	> trained with a 3-scale pool (`avg_1x1 ‖ avg_2x2 ‖ max_1x1` → 768-dim).
	> Because of this incompatibility the trainer reinitializes
	> `ref_adaln_proj` from scratch when continuing from
	> `ref_adaln_proj-role_embedding`; the AdaLN projector in the continuation
	> is not a fine-tune of the original one. The output dim 36864 = AdaLN
	> param count for the LTX-2 transformer (read at runtime via
	> `preprocessor.adaln.linear.out_features`).

	### 3.3. `ref_visual_proj` — visual cross-attention memory tokens

	Present in `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` only.
	`SafeVisualRefProjector` (training file `video_to_video_ref_visual.py`).
	Produces 32 visual memory tokens consumed by the new `ref_attn` branch.

	\| Key \| Shape \| Notes \|
	\|---\|---\|---\|
	\| `ref_visual_proj.fc1.weight` \| (1024, 384) \| input 384 = 128 (local pooled) + 128 (global mean) + 128 (global std) \|
	\| `ref_visual_proj.fc1.bias` \| (1024,) \| xavier init gain 0.1 \|
	\| `ref_visual_proj.proj.weight` \| (4096, 1024) \| maps to context_dim 4096; xavier init gain 0.05 \|
	\| `ref_visual_proj.proj.bias` \| (4096,) \| \|
	\| `ref_visual_proj.norm.weight` \| (4096,) \| LayerNorm γ \|
	\| `ref_visual_proj.norm.bias` \| (4096,) \| LayerNorm β \|
	\| `ref_visual_proj.pos_embed` \| (1, 32, 4096) \| per-token learned positional bias \|

	Forward (matches `SafeVisualRefProjector.forward`):
	```
	tokens = local ‖ global_mean ‖ global_std # [B, 32, 384]
	tokens = proj(silu(fc1(tokens))) # → [B, 32, 4096]
	tokens = LayerNorm(tokens)
	tokens = tokens + pos_embed[:, :32]
	return tokens * token_scale # training default 0.25
	```

	Not present in `ref_adaln_proj-role_embedding` — this entire branch is new.

	---

	## 4. Total tensor counts (sanity check)

	### `ref_adaln_proj-role_embedding`
	```
	LoRA: 10 modules × 48 blocks × 2 (A,B) = 960
	ref_adaln_proj: 4 (fc1.{w,b}, proj.{w,b}) = 4
	role_embedding: 1 = 1
	total= 965 ✓
	```

	### `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`
	```
	LoRA: 14 modules × 48 blocks × 2 (A,B) = 1344
	ref_adaln_proj: 4 = 4
	ref_visual_proj: 7 = 7
	role_embedding: 1 = 1
	total= 1356 ✓
	```

	---

	## 5. Loading checkpoint at inference

	Use `scripts/split_editanything_lora.py` to split each raw training
	checkpoint into:
	- `*.standard.safetensors` — LoRA on `attn1/attn2/ff` only; safe to feed to
	ComfyUI's standard LoraLoader.
	- `*.module.safetensors` — everything else (`role_embedding`,
	`ref_adaln_proj`, `ref_visual_proj`, `ref_attn` LoRA adapters); feed to
	`LTXVEditAnythingModuleLoader`.

	The filename suffix lists every extra that ended up in the module sidecar,
	so it is obvious at a glance which mechanisms a given pair carries. Order is
	fixed: `ref_adaln_proj`, `role_embedding`, `ref_attn`, `ref_visual_proj`.

	### Canonical output names

	```
	edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.standard.safetensors
	edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.module.safetensors

	edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.standard.safetensors
	edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.module.safetensors
	```

	### Command

	```bash
	python3 /data/training/ltx-edit-trainer/scripts/split_editanything_lora.py \
	<raw-checkpoint>.safetensors --output-dir <dir> [--overwrite]
	```