EditAnything / lora_layers_reference.md

Upload folder using huggingface_hub

775562c verified 1 day ago

preview code

raw

history blame contribute delete

8.42 kB

LoRA Layer Inventory — Edit Anything checkpoints

Inventory of every tensor in two builds of the edit_anything_reference_v0.1_r128 LoRA.

Both builds share the same canonical basename (edit_anything_reference_v0.1_r128) and are distinguished by the extras suffix that scripts/split_editanything_lora.py appends to the output filenames:

edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.{standard,module}.safetensors — the original build. Only ships ref_adaln_proj + role_embedding.
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.{standard,module}.safetensors — the continuation, fine-tuned with the video_to_video_ref_visual_adaln strategy. Adds the ref_attn LoRA branch and the ref_visual_proj projector on top of the original extras.

In the rest of this doc the two are referred to by their suffix only:

ref_adaln_proj-role_embedding
ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj

Rank is 128 in both (encoded in the LoRA tensor shapes; no alpha keys saved). Dtype is bfloat16 throughout. All LoRA modules cover 48 transformer blocks.

1. Summary

	`ref_adaln_proj-role_embedding`	`ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`
Total tensors	965	1356
LoRA-target modules	10	14
LoRA tensors (A+B)	960	1344
Extra (non-LoRA) tensors	5	12
`ref_attn` LoRA branch	❌ absent	✅ trained on 48 blocks
`ref_visual_proj` (visual cross-attn projector)	❌ absent	✅ present (7 tensors)
`ref_adaln_proj` (global appearance AdaLN)	✅ (fc1 input dim 256)	✅ (fc1 input dim 768)
`role_embedding`	✅ shape (1, 128)	✅ shape (1, 128)

2. LoRA adapters

Each row = one target module type. Each entry = (lora_A.weight, lora_B.weight) duplicated across the 48 blocks of diffusion_model.transformer_blocks.*.

Module	`ref_adaln_proj-role_embedding`	`ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`	Notes
`attn1.to_q`	✅	✅	self-attention query
`attn1.to_k`	✅	✅	self-attention key
`attn1.to_v`	✅	✅	self-attention value
`attn1.to_out.0`	✅	✅	self-attention output proj
`attn2.to_q`	✅	✅	cross-attention to text (Gemma)
`attn2.to_k`	✅	✅
`attn2.to_v`	✅	✅
`attn2.to_out.0`	✅	✅
`ff.net.0.proj`	✅	✅	feed-forward up-projection
`ff.net.2`	✅	✅	feed-forward down-projection
`ref_attn.to_q`	—	✅	new — visual reference cross-attention
`ref_attn.to_k`	—	✅	new
`ref_attn.to_v`	—	✅	new
`ref_attn.to_out.0`	—	✅	new

Key naming: diffusion_model.transformer_blocks.{0..47}.{module}.{lora_A|lora_B}.weight

Training freeze policy for the ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj build (per stage2_ref_visual_adaln_crossattn_from_v01_r128.yaml):

attn1.* adapters loaded from the ref_adaln_proj-role_embedding build but frozen (trainable_include_patterns excludes them).
attn2.*, ff.*, ref_attn.* are trainable.

3. Non-LoRA modules (the module sidecar)

These tensors live at the top of the state dict (no transformer_blocks.* prefix) and are consumed by the custom inference path (LTXVEditAnythingModuleLoader + LTXVEditAnythingLoopingSampler), not by the standard ComfyUI LoRA loader.

3.1. `role_embedding` — appearance role bias

Key	Shape	Notes
`role_embedding.embedding.weight`	(1, 128)	1 slot (appearance). Padded to (3, 128) at inference; entry stored at slot 1 (ref_img role).

Present in both builds with the same shape. In the ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj build it is frozen (use_visual_ref_role_embedding: false); wandb shows its norm stays flat at ~0.125 throughout training.

3.2. `ref_adaln_proj` — global AdaLN appearance anchor

Two-layer MLP that pools the reference latent into a vector added to every block's AdaLN timestep bias.

Key	`ref_adaln_proj-role_embedding` shape	`ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` shape
`ref_adaln_proj.fc1.weight`	(512, 256)	(512, 768)
`ref_adaln_proj.fc1.bias`	(512,)	(512,)
`ref_adaln_proj.proj.weight`	(36864, 512)	(36864, 512)
`ref_adaln_proj.proj.bias`	(36864,)	(36864,)

⚠️ Shape mismatch on fc1.weight. The ref_adaln_proj-role_embedding build was trained with a 2-scale pool (avg_1x1 ‖ max_1x1 → 256-dim input). The ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj build was trained with a 3-scale pool (avg_1x1 ‖ avg_2x2 ‖ max_1x1 → 768-dim). Because of this incompatibility the trainer reinitializes ref_adaln_proj from scratch when continuing from ref_adaln_proj-role_embedding; the AdaLN projector in the continuation is not a fine-tune of the original one. The output dim 36864 = AdaLN param count for the LTX-2 transformer (read at runtime via preprocessor.adaln.linear.out_features).

3.3. `ref_visual_proj` — visual cross-attention memory tokens

Present in ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj only. SafeVisualRefProjector (training file video_to_video_ref_visual.py). Produces 32 visual memory tokens consumed by the new ref_attn branch.

Key	Shape	Notes
`ref_visual_proj.fc1.weight`	(1024, 384)	input 384 = 128 (local pooled) + 128 (global mean) + 128 (global std)
`ref_visual_proj.fc1.bias`	(1024,)	xavier init gain 0.1
`ref_visual_proj.proj.weight`	(4096, 1024)	maps to context_dim 4096; xavier init gain 0.05
`ref_visual_proj.proj.bias`	(4096,)
`ref_visual_proj.norm.weight`	(4096,)	LayerNorm γ
`ref_visual_proj.norm.bias`	(4096,)	LayerNorm β
`ref_visual_proj.pos_embed`	(1, 32, 4096)	per-token learned positional bias

Forward (matches SafeVisualRefProjector.forward):

tokens = local ‖ global_mean ‖ global_std          # [B, 32, 384]
tokens = proj(silu(fc1(tokens)))                   # → [B, 32, 4096]
tokens = LayerNorm(tokens)
tokens = tokens + pos_embed[:, :32]
return tokens * token_scale                        # training default 0.25

Not present in ref_adaln_proj-role_embedding — this entire branch is new.

4. Total tensor counts (sanity check)

`ref_adaln_proj-role_embedding`

LoRA: 10 modules × 48 blocks × 2 (A,B)            = 960
ref_adaln_proj: 4 (fc1.{w,b}, proj.{w,b})         =   4
role_embedding: 1                                 =   1
                                              total= 965 ✓

`ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`

LoRA: 14 modules × 48 blocks × 2 (A,B)            = 1344
ref_adaln_proj: 4                                 =    4
ref_visual_proj: 7                                =    7
role_embedding: 1                                 =    1
                                              total= 1356 ✓

5. Loading checkpoint at inference

Use scripts/split_editanything_lora.py to split each raw training checkpoint into:

*.standard.safetensors — LoRA on attn1/attn2/ff only; safe to feed to ComfyUI's standard LoraLoader.
*.module.safetensors — everything else (role_embedding, ref_adaln_proj, ref_visual_proj, ref_attn LoRA adapters); feed to LTXVEditAnythingModuleLoader.

The filename suffix lists every extra that ended up in the module sidecar, so it is obvious at a glance which mechanisms a given pair carries. Order is fixed: ref_adaln_proj, role_embedding, ref_attn, ref_visual_proj.

Canonical output names

edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.standard.safetensors
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.module.safetensors

edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.standard.safetensors
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.module.safetensors

Command

python3 /data/training/ltx-edit-trainer/scripts/split_editanything_lora.py \
  <raw-checkpoint>.safetensors --output-dir <dir> [--overwrite]