EditAnything / lora_layers_impact.md

Upload folder using huggingface_hub

775562c verified 13 days ago

13.8 kB

Functional differences between the two builds and what each layer does

Companion to lora_layers_reference.md. That file is the inventory; this one explains the functional role of every group of tensors and the expected behavioral impact of toggling each branch at inference.

Two builds of the edit_anything_reference_v0.1_r128 LoRA exist, each delivered as a (.standard, .module) pair. The pairs are distinguished by their extras suffix:

..._ref_adaln_proj-role_embedding.{standard,module}.safetensors — the original build. One mechanism for steering the model toward the reference image: global AdaLN appearance anchoring (plus the IC-LoRA-style ref tokens packed into the sequence).
..._ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.{standard,module}.safetensors — the continuation. Keeps everything from the original build and adds two new mechanisms that operate on different time/space scales.

In the rest of this doc the two builds are referred to by their suffix only:

ref_adaln_proj-role_embedding
ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj

TL;DR

Branch	Where it acts	What it controls	New in `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`?
`attn1` LoRA	self-attention inside every block	scene cohesion, structural editing	no (carried over, frozen)
`attn2` LoRA	cross-attention to text (Gemma)	prompt following	no (re-trained)
`ff` LoRA	feed-forward MLP	feature mixing / capacity	no (re-trained)
`ref_attn` LoRA	dedicated cross-attention to 32 visual memory tokens	preserving fine-grained appearance of the reference	yes
`ref_visual_proj`	projects the ref VAE latent into 32 context tokens	the content that `ref_attn` attends to	yes
`ref_adaln_proj`	produces a global vector added to the timestep AdaLN	overall color/style/identity bias	retrained (new pooling)
`role_embedding`	adds a 128-dim bias to ref tokens in the IC-LoRA sequence	tells the transformer "this token is the reference"	frozen in the continuation

So:

ref_adaln_proj-role_embedding only had a slow, global appearance signal (AdaLN) plus the IC-LoRA-style ref tokens packed into the sequence.
ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj adds a fast, local appearance signal (visual cross-attention) that injects the reference's actual textures into every block in the 12 → 35 range.

1. The 10 modules shared between both builds

These cover the full 48-block transformer and were retrained in ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj (except attn1.*, which is loaded but frozen — see the training freeze policy in the inventory).

`attn1.{q,k,v,out.0}` — self-attention

Every transformer block first does self-attention over the latent video tokens. The LoRA here adjusts how tokens relate to each other:

structural consistency of the generated frames,
how strongly the @reference IC-LoRA token influences neighboring spatial positions,
low-level look (sharpness, contrast).

In the ..._ref_attn-ref_visual_proj build these are frozen on purpose so the original priors over motion and structure stay intact. If the inference output looks structurally broken (jitter, motion drift, layout collapse), you probably misloaded these adapters or the standard LoRA is at the wrong strength.

`attn2.{q,k,v,out.0}` — cross-attention to text

This is the prompt-following path. The Gemma text embedding is the K/V; the video latent is the Q. The LoRA tunes how the prompt drives the edit.

Stronger attn2 deltas ⇒ the model leans more on the prompt ("Add @reference sleeping on the armrest"). Useful for compositional control.
If you disable or weaken the standard LoRA (e.g. strength_model=0), the base model goes back to ignoring your edit instructions — even if ref_attn is still active, the prompt-binding is gone.

`ff.net.{0.proj, 2}` — MLP capacity

The block's feed-forward part. The LoRA here adds representational capacity to absorb the new behaviors that prompt + reference impose. There is no single user-visible "knob" for this; it works behind the scenes.

If you slash its strength you'll see colors and textures drift back toward generic LTX-2 outputs.

2. The new `ref_attn` branch

This is the heart of the change in ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj. Each of the 48 transformer blocks now has a fourth attention head, ref_attn, in addition to attn1 (self) and attn2 (text). ref_attn cross-attends from the noisy video latent (Q) to a small set of visual memory tokens computed from the reference image (K/V).

Why three projections (q/k/v/out.0)

A standard cross-attention. The base weights are copied from attn2 at load time (init_ref_attn_from: attn2) so the module starts as "text cross-attn, but pointed at visual tokens"; the LoRA then teaches it to actually use those visual tokens.

Per-block gating

ref_attn is only consulted in blocks 12 → 35 (this is what ref_start_block / ref_end_block enforce at inference and what the trainer used during fine-tuning). Skipping blocks 0–11 keeps the early low-level features untouched; skipping blocks 36–47 lets the late decoding stages do their job without extra visual bias.

Impact

Strong identity preservation for things the AdaLN anchor can't capture (small logos, eye color, fur texture, asymmetric details).
Scaled by ref_context_scale (training default 0.01). Small for a reason: the visual tokens are dense, and the residual is added on top of every block in the 12–35 range — even at 0.01 the cumulative effect is meaningful.
Doubling the scale (→ 0.02) usually intensifies identity at the cost of motion fidelity; going to 0.05+ tends to "freeze" parts of the scene to the reference appearance.
Setting ref_start_block=0 is destructive: blocks 0–11 never saw ref_context during training, so injecting it there feeds the model noise — outputs collapse to black or random patterns.

3. The new `ref_visual_proj`

This is the source of what ref_attn attends to. Without it the ref_attn LoRA is useless — there are no visual tokens to read.

Forward

ref_frame  = mean over time of the ref VAE latent       # [B, 128, H, W]
local      = adaptive_avg_pool to (4, 8)                 # 32 spatial cells
global_mean, global_std over the whole frame             # 2 × 128
tokens     = concat(local, broadcast(mean,std))          # [B, 32, 384]
tokens     = proj(silu(fc1(tokens)))                     # [B, 32, 4096]
tokens     = LayerNorm(tokens)
tokens     = tokens + pos_embed[:, :32]
return tokens * token_scale                              # 0.25 in training

Layer-by-layer impact

Tensor	What it controls	If perturbed
`fc1.weight / bias` (1024×384)	maps the 384-dim raw appearance descriptor into the projector's hidden space	weights here decide which aspects of the pooled appearance survive (e.g. color vs. texture vs. luminance)
`proj.weight / bias` (4096×1024)	lifts the hidden vector into the transformer context dim	initialized with small gain (0.05) so the branch starts almost-no-op; loaded from training
`norm.weight / bias` (4096)	LayerNorm on the projected tokens	keeps numerical range consistent across reference images so `ref_attn` works at the same scale regardless of input statistics
`pos_embed` (1, 32, 4096)	per-position bias for the 32 memory tokens	the model uses this to distinguish "top-left cell" from "bottom-right cell" — without it, all 32 tokens would be permutation-invariant and `ref_attn` would degenerate

`ref_token_scale` (training = 0.25)

This is the runtime multiplier on the output. It is not a stored tensor but a knob in the inference node. Doubling it (→ 0.5) effectively doubles the K/V magnitude that ref_attn reads, which biases attention scores toward the reference tokens. Combined with ref_context_scale, you have two independent ways to over-/under-amplify the visual reference branch.

4. `ref_adaln_proj` — retrained, not continued

Both builds have this projector, but the input dimension changed:

	`ref_adaln_proj-role_embedding`	`ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`
Pooling	`avg_1x1 ‖ max_1x1` (2-scale)	`avg_1x1 ‖ avg_2x2 ‖ max_1x1` (3-scale)
`fc1.weight` shape	(512, 256)	(512, 768)

Because of the shape mismatch the trainer reinitializes ref_adaln_proj from scratch when continuing from ref_adaln_proj-role_embedding. The ref_adaln_proj in the continuation is not a fine-tune of the original — it learned fresh. wandb confirms this: ref_proj/weight_norm ramps from near-zero to ~2.9.

What it actually does

Builds one per-sample vector that is added to the timestep bias fed into every transformer block's AdaLN layer. The result: a persistent, sample-wide "lean toward this reference" applied throughout denoising.

Why this is the complement of `ref_attn`

ref_attn is localized: visual tokens cross-attend per spatial cell, letting the model copy fine details.
ref_adaln_proj is global: a single conditioning vector tints all 48 blocks uniformly. Best for "the overall look of the output should remind me of this reference" (palette, lighting, broad style).

`adaln_scale` (training = 2.0)

The user-side multiplier. At training default 2.0, AdaLN is doing a lot of the appearance lifting. Common failure modes:

adaln_scale=0: model ignores the reference's global look; you keep only what ref_attn and the IC-LoRA tokens can recover. Expect washed-out identity.
adaln_scale=1.0 (ComfyUI default before the recent realignment): exactly half the training-time strength. Identity is still recognizable but visibly weaker.
adaln_scale>3: identity dominates and the model starts ignoring the prompt / guide motion.

5. `role_embedding` — present in both, behavior depends on which you load

A learned [1, 128] vector that adds a fingerprint to the patchified tokens belonging to the IC-LoRA reference image, so the transformer can tell the ref token apart from generic guide / target tokens.

In `ref_adaln_proj-role_embedding`

Was trained with use_visual_ref_role_embedding=True — that's where the non-zero value (~0.125 norm) comes from. The attn1/attn2 adapters in this build therefore learned to recognize this bias.

In `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`

Inherits the value from ref_adaln_proj-role_embedding but trains with use_visual_ref_role_embedding=False, meaning the bias is never added during training. The vector is frozen at its inherited value; wandb shows its norm flat at 0.125 across the whole run.

Inference rule

When loading ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj: keep enable_role_embedding=False. Turning it on adds a bias to the ref tokens that this build never saw — the attn1/attn2 adapters retrained without it, so the bias becomes adversarial noise and degrades the output.

When loading ref_adaln_proj-role_embedding directly (no ..._ref_attn-ref_visual_proj adapters), the opposite is true: enable_role_embedding=True matches the training distribution.

6. Quick reference: what each knob does at inference

Knob	`..._ref_attn-ref_visual_proj` training value	Effect of raising it	Effect of lowering it
`adaln_scale`	2.0	stronger global look	identity fades
`ref_context_scale`	0.01	sharper fine-grained ID; can over-freeze	local detail blurs back to base
`ref_token_scale`	0.25	more "voice" for the visual tokens in attention	`ref_attn` becomes a no-op
`ref_start_block` / `ref_end_block`	12 / 35	(do not change)	(do not change) — outside this range the LoRA is untrained
`enable_role_embedding`	False	adds out-of-distribution bias to ref tokens	matches training
`role_strength`	n/a	only matters if `enable_role_embedding=True`
Standard LoRA `strength_model`	1.0	over-fits to training distribution	drifts back toward base LTX-2

The combination that mirrors training of the ..._ref_attn-ref_visual_proj build exactly: adaln_scale=2.0, ref_context_scale=0.01, ref_token_scale=0.25, ref_start_block=12, ref_end_block=35, enable_role_embedding=False, ref_init_from=attn2, strength_model=1.0.

7. Where the loaded files come from

scripts/split_editanything_lora.py produces two safetensors per checkpoint. The filename suffix lists every extra that ended up in the module sidecar (fixed order: ref_adaln_proj, role_embedding, ref_attn, ref_visual_proj), so you can tell which mechanisms each pair carries without opening the file.

Canonical pairs:

edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.standard.safetensors
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.module.safetensors

edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.standard.safetensors
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.module.safetensors

Feed the .standard.* into ComfyUI's standard LoRA loader and the .module.* into LTXVEditAnythingModuleLoader. Mixing pairs across builds (e.g., ref_adaln_proj-role_embedding.standard.* with ..._ref_attn-ref_visual_proj.module.*) is not supported — the LoRA deltas were trained against the partner adapters in the same build.

Functional differences between the two builds and what each layer does

TL;DR

1. The 10 modules shared between both builds

attn1.{q,k,v,out.0} — self-attention

attn2.{q,k,v,out.0} — cross-attention to text

ff.net.{0.proj, 2} — MLP capacity

2. The new ref_attn branch