EditAnything / lora_layers_reference.md
Alissonerdx's picture
Upload folder using huggingface_hub
775562c verified

LoRA Layer Inventory β€” Edit Anything checkpoints

Inventory of every tensor in two builds of the edit_anything_reference_v0.1_r128 LoRA.

Both builds share the same canonical basename (edit_anything_reference_v0.1_r128) and are distinguished by the extras suffix that scripts/split_editanything_lora.py appends to the output filenames:

  • edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.{standard,module}.safetensors β€” the original build. Only ships ref_adaln_proj + role_embedding.
  • edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.{standard,module}.safetensors β€” the continuation, fine-tuned with the video_to_video_ref_visual_adaln strategy. Adds the ref_attn LoRA branch and the ref_visual_proj projector on top of the original extras.

In the rest of this doc the two are referred to by their suffix only:

  • ref_adaln_proj-role_embedding
  • ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj

Rank is 128 in both (encoded in the LoRA tensor shapes; no alpha keys saved). Dtype is bfloat16 throughout. All LoRA modules cover 48 transformer blocks.


1. Summary

ref_adaln_proj-role_embedding ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj
Total tensors 965 1356
LoRA-target modules 10 14
LoRA tensors (A+B) 960 1344
Extra (non-LoRA) tensors 5 12
ref_attn LoRA branch ❌ absent βœ… trained on 48 blocks
ref_visual_proj (visual cross-attn projector) ❌ absent βœ… present (7 tensors)
ref_adaln_proj (global appearance AdaLN) βœ… (fc1 input dim 256) βœ… (fc1 input dim 768)
role_embedding βœ… shape (1, 128) βœ… shape (1, 128)

2. LoRA adapters

Each row = one target module type. Each entry = (lora_A.weight, lora_B.weight) duplicated across the 48 blocks of diffusion_model.transformer_blocks.*.

Module ref_adaln_proj-role_embedding ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj Notes
attn1.to_q βœ… βœ… self-attention query
attn1.to_k βœ… βœ… self-attention key
attn1.to_v βœ… βœ… self-attention value
attn1.to_out.0 βœ… βœ… self-attention output proj
attn2.to_q βœ… βœ… cross-attention to text (Gemma)
attn2.to_k βœ… βœ…
attn2.to_v βœ… βœ…
attn2.to_out.0 βœ… βœ…
ff.net.0.proj βœ… βœ… feed-forward up-projection
ff.net.2 βœ… βœ… feed-forward down-projection
ref_attn.to_q β€” βœ… new β€” visual reference cross-attention
ref_attn.to_k β€” βœ… new
ref_attn.to_v β€” βœ… new
ref_attn.to_out.0 β€” βœ… new

Key naming: diffusion_model.transformer_blocks.{0..47}.{module}.{lora_A|lora_B}.weight

Training freeze policy for the ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj build (per stage2_ref_visual_adaln_crossattn_from_v01_r128.yaml):

  • attn1.* adapters loaded from the ref_adaln_proj-role_embedding build but frozen (trainable_include_patterns excludes them).
  • attn2.*, ff.*, ref_attn.* are trainable.

3. Non-LoRA modules (the module sidecar)

These tensors live at the top of the state dict (no transformer_blocks.* prefix) and are consumed by the custom inference path (LTXVEditAnythingModuleLoader + LTXVEditAnythingLoopingSampler), not by the standard ComfyUI LoRA loader.

3.1. role_embedding β€” appearance role bias

Key Shape Notes
role_embedding.embedding.weight (1, 128) 1 slot (appearance). Padded to (3, 128) at inference; entry stored at slot 1 (ref_img role).

Present in both builds with the same shape. In the ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj build it is frozen (use_visual_ref_role_embedding: false); wandb shows its norm stays flat at ~0.125 throughout training.

3.2. ref_adaln_proj β€” global AdaLN appearance anchor

Two-layer MLP that pools the reference latent into a vector added to every block's AdaLN timestep bias.

Key ref_adaln_proj-role_embedding shape ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj shape
ref_adaln_proj.fc1.weight (512, 256) (512, 768)
ref_adaln_proj.fc1.bias (512,) (512,)
ref_adaln_proj.proj.weight (36864, 512) (36864, 512)
ref_adaln_proj.proj.bias (36864,) (36864,)

⚠️ Shape mismatch on fc1.weight. The ref_adaln_proj-role_embedding build was trained with a 2-scale pool (avg_1x1 β€– max_1x1 β†’ 256-dim input). The ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj build was trained with a 3-scale pool (avg_1x1 β€– avg_2x2 β€– max_1x1 β†’ 768-dim). Because of this incompatibility the trainer reinitializes ref_adaln_proj from scratch when continuing from ref_adaln_proj-role_embedding; the AdaLN projector in the continuation is not a fine-tune of the original one. The output dim 36864 = AdaLN param count for the LTX-2 transformer (read at runtime via preprocessor.adaln.linear.out_features).

3.3. ref_visual_proj β€” visual cross-attention memory tokens

Present in ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj only. SafeVisualRefProjector (training file video_to_video_ref_visual.py). Produces 32 visual memory tokens consumed by the new ref_attn branch.

Key Shape Notes
ref_visual_proj.fc1.weight (1024, 384) input 384 = 128 (local pooled) + 128 (global mean) + 128 (global std)
ref_visual_proj.fc1.bias (1024,) xavier init gain 0.1
ref_visual_proj.proj.weight (4096, 1024) maps to context_dim 4096; xavier init gain 0.05
ref_visual_proj.proj.bias (4096,)
ref_visual_proj.norm.weight (4096,) LayerNorm Ξ³
ref_visual_proj.norm.bias (4096,) LayerNorm Ξ²
ref_visual_proj.pos_embed (1, 32, 4096) per-token learned positional bias

Forward (matches SafeVisualRefProjector.forward):

tokens = local β€– global_mean β€– global_std          # [B, 32, 384]
tokens = proj(silu(fc1(tokens)))                   # β†’ [B, 32, 4096]
tokens = LayerNorm(tokens)
tokens = tokens + pos_embed[:, :32]
return tokens * token_scale                        # training default 0.25

Not present in ref_adaln_proj-role_embedding β€” this entire branch is new.


4. Total tensor counts (sanity check)

ref_adaln_proj-role_embedding

LoRA: 10 modules Γ— 48 blocks Γ— 2 (A,B)            = 960
ref_adaln_proj: 4 (fc1.{w,b}, proj.{w,b})         =   4
role_embedding: 1                                 =   1
                                              total= 965 βœ“

ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj

LoRA: 14 modules Γ— 48 blocks Γ— 2 (A,B)            = 1344
ref_adaln_proj: 4                                 =    4
ref_visual_proj: 7                                =    7
role_embedding: 1                                 =    1
                                              total= 1356 βœ“

5. Loading checkpoint at inference

Use scripts/split_editanything_lora.py to split each raw training checkpoint into:

  • *.standard.safetensors β€” LoRA on attn1/attn2/ff only; safe to feed to ComfyUI's standard LoraLoader.
  • *.module.safetensors β€” everything else (role_embedding, ref_adaln_proj, ref_visual_proj, ref_attn LoRA adapters); feed to LTXVEditAnythingModuleLoader.

The filename suffix lists every extra that ended up in the module sidecar, so it is obvious at a glance which mechanisms a given pair carries. Order is fixed: ref_adaln_proj, role_embedding, ref_attn, ref_visual_proj.

Canonical output names

edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.standard.safetensors
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.module.safetensors

edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.standard.safetensors
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.module.safetensors

Command

python3 /data/training/ltx-edit-trainer/scripts/split_editanything_lora.py \
  <raw-checkpoint>.safetensors --output-dir <dir> [--overwrite]