EditAnything / lora_layers_impact.md

Upload folder using huggingface_hub

775562c verified 14 days ago

13.8 kB

	# Functional differences between the two builds and what each layer does

	Companion to `lora_layers_reference.md`. That file is the inventory; this one
	explains the functional role of every group of tensors and the **expected
	behavioral impact** of toggling each branch at inference.

	Two builds of the `edit_anything_reference_v0.1_r128` LoRA exist, each
	delivered as a `(.standard, .module)` pair. The pairs are distinguished by
	their extras suffix:

	- `..._ref_adaln_proj-role_embedding.{standard,module}.safetensors` — the
	original build. One mechanism for steering the model toward the reference
	image: global AdaLN appearance anchoring (plus the IC-LoRA-style ref
	tokens packed into the sequence).
	- `..._ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.{standard,module}.safetensors`
	— the continuation. Keeps everything from the original build and adds
	two new mechanisms that operate on different time/space scales.

	In the rest of this doc the two builds are referred to by their suffix only:
	- `ref_adaln_proj-role_embedding`
	- `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`

	---

	## TL;DR

	\| Branch \| Where it acts \| What it controls \| New in `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`? \|
	\|---\|---\|---\|---\|
	\| `attn1` LoRA \| self-attention inside every block \| scene cohesion, structural editing \| no (carried over, frozen) \|
	\| `attn2` LoRA \| cross-attention to text (Gemma) \| prompt following \| no (re-trained) \|
	\| `ff` LoRA \| feed-forward MLP \| feature mixing / capacity \| no (re-trained) \|
	\| `ref_attn` LoRA \| dedicated cross-attention to 32 visual memory tokens \| preserving fine-grained appearance of the reference \| yes \|
	\| `ref_visual_proj` \| projects the ref VAE latent into 32 context tokens \| the content that `ref_attn` attends to \| yes \|
	\| `ref_adaln_proj` \| produces a global vector added to the timestep AdaLN \| overall color/style/identity bias \| retrained (new pooling) \|
	\| `role_embedding` \| adds a 128-dim bias to ref tokens in the IC-LoRA sequence \| tells the transformer "this token is the reference" \| frozen in the continuation \|

	So:
	- `ref_adaln_proj-role_embedding` only had a slow, global appearance signal
	(AdaLN) plus the IC-LoRA-style ref tokens packed into the sequence.
	- `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` adds a **fast,
	local** appearance signal (visual cross-attention) that injects the
	reference's actual textures into every block in the 12 → 35 range.

	---

	## 1. The 10 modules shared between both builds

	These cover the full 48-block transformer and were retrained in
	`ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` (except `attn1.*`,
	which is loaded but frozen — see the training freeze policy in the inventory).

	### `attn1.{q,k,v,out.0}` — self-attention

	Every transformer block first does self-attention over the latent video
	tokens. The LoRA here adjusts how tokens relate to each other:
	- structural consistency of the generated frames,
	- **how strongly the `@reference` IC-LoRA token influences neighboring
	spatial positions**,
	- low-level look (sharpness, contrast).

	In the `..._ref_attn-ref_visual_proj` build these are frozen on purpose so
	the original priors over motion and structure stay intact. If the inference
	output looks structurally broken (jitter, motion drift, layout collapse),
	you probably misloaded these adapters or the standard LoRA is at the wrong
	strength.

	### `attn2.{q,k,v,out.0}` — cross-attention to text

	This is the prompt-following path. The Gemma text embedding is the K/V; the
	video latent is the Q. The LoRA tunes how the prompt drives the edit.

	- Stronger `attn2` deltas ⇒ the model leans more on the prompt ("Add
	@reference sleeping on the armrest"). Useful for compositional control.
	- If you disable or weaken the standard LoRA (e.g. `strength_model=0`), the
	base model goes back to ignoring your edit instructions — even if `ref_attn`
	is still active, the prompt-binding is gone.

	### `ff.net.{0.proj, 2}` — MLP capacity

	The block's feed-forward part. The LoRA here adds **representational
	capacity** to absorb the new behaviors that prompt + reference impose. There
	is no single user-visible "knob" for this; it works behind the scenes.

	If you slash its strength you'll see colors and textures drift back toward
	generic LTX-2 outputs.

	---

	## 2. The new `ref_attn` branch

	This is the heart of the change in
	`ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`. Each of the 48
	transformer blocks now has a fourth attention head, `ref_attn`, in
	addition to `attn1` (self) and `attn2` (text). `ref_attn` cross-attends from
	the noisy video latent (Q) to **a small set of visual memory tokens
	computed from the reference image** (K/V).

	### Why three projections (q/k/v/out.0)

	A standard cross-attention. The base weights are copied from `attn2` at load
	time (`init_ref_attn_from: attn2`) so the module starts as "text cross-attn,
	but pointed at visual tokens"; the LoRA then teaches it to actually
	use those visual tokens.

	### Per-block gating

	`ref_attn` is only consulted in blocks 12 → 35 (this is what
	`ref_start_block` / `ref_end_block` enforce at inference and what the trainer
	used during fine-tuning). Skipping blocks 0–11 keeps the early low-level
	features untouched; skipping blocks 36–47 lets the late decoding stages do
	their job without extra visual bias.

	### Impact

	- Strong identity preservation for things the AdaLN anchor can't capture
	(small logos, eye color, fur texture, asymmetric details).
	- Scaled by `ref_context_scale` (training default `0.01`). Small for a
	reason: the visual tokens are dense, and the residual is added on top of
	every block in the 12–35 range — even at 0.01 the cumulative effect is
	meaningful.
	- Doubling the scale (→ 0.02) usually intensifies identity at the cost of
	motion fidelity; going to 0.05+ tends to "freeze" parts of the scene to the
	reference appearance.
	- Setting `ref_start_block=0` is destructive: blocks 0–11 never saw
	`ref_context` during training, so injecting it there feeds the model
	noise — outputs collapse to black or random patterns.

	---

	## 3. The new `ref_visual_proj`

	This is the source of what `ref_attn` attends to. Without it the
	`ref_attn` LoRA is useless — there are no visual tokens to read.

	### Forward

	```
	ref_frame = mean over time of the ref VAE latent # [B, 128, H, W]
	local = adaptive_avg_pool to (4, 8) # 32 spatial cells
	global_mean, global_std over the whole frame # 2 × 128
	tokens = concat(local, broadcast(mean,std)) # [B, 32, 384]
	tokens = proj(silu(fc1(tokens))) # [B, 32, 4096]
	tokens = LayerNorm(tokens)
	tokens = tokens + pos_embed[:, :32]
	return tokens * token_scale # 0.25 in training
	```

	### Layer-by-layer impact

	\| Tensor \| What it controls \| If perturbed \|
	\|---\|---\|---\|
	\| `fc1.weight / bias` (1024×384) \| maps the 384-dim raw appearance descriptor into the projector's hidden space \| weights here decide which aspects of the pooled appearance survive (e.g. color vs. texture vs. luminance) \|
	\| `proj.weight / bias` (4096×1024) \| lifts the hidden vector into the transformer context dim \| initialized with small gain (0.05) so the branch starts almost-no-op; loaded from training \|
	\| `norm.weight / bias` (4096) \| LayerNorm on the projected tokens \| keeps numerical range consistent across reference images so `ref_attn` works at the same scale regardless of input statistics \|
	\| `pos_embed` (1, 32, 4096) \| per-position bias for the 32 memory tokens \| the model uses this to distinguish "top-left cell" from "bottom-right cell" — without it, all 32 tokens would be permutation-invariant and `ref_attn` would degenerate \|

	### `ref_token_scale` (training = 0.25)

	This is the runtime multiplier on the output. It is not a stored tensor
	but a knob in the inference node. Doubling it (→ 0.5) effectively doubles
	the K/V magnitude that `ref_attn` reads, which biases attention scores
	toward the reference tokens. Combined with `ref_context_scale`, you have
	two independent ways to over-/under-amplify the visual reference branch.

	---

	## 4. `ref_adaln_proj` — retrained, not continued

	Both builds have this projector, but the input dimension changed:

	\| \| `ref_adaln_proj-role_embedding` \| `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` \|
	\|---\|---\|---\|
	\| Pooling \| `avg_1x1 ‖ max_1x1` (2-scale) \| `avg_1x1 ‖ avg_2x2 ‖ max_1x1` (3-scale) \|
	\| `fc1.weight` shape \| (512, 256) \| (512, 768) \|

	Because of the shape mismatch the trainer reinitializes `ref_adaln_proj`
	from scratch when continuing from `ref_adaln_proj-role_embedding`. The
	`ref_adaln_proj` in the continuation is not a fine-tune of the original — it
	learned fresh. wandb confirms this: `ref_proj/weight_norm` ramps from
	near-zero to ~2.9.

	### What it actually does

	Builds one per-sample vector that is added to the timestep bias fed
	into every transformer block's AdaLN layer. The result: a persistent,
	sample-wide "lean toward this reference" applied throughout denoising.

	### Why this is the complement of `ref_attn`

	- `ref_attn` is localized: visual tokens cross-attend per spatial cell,
	letting the model copy fine details.
	- `ref_adaln_proj` is global: a single conditioning vector tints all 48
	blocks uniformly. Best for "the overall look of the output should remind
	me of this reference" (palette, lighting, broad style).

	### `adaln_scale` (training = 2.0)

	The user-side multiplier. At training default 2.0, AdaLN is doing a lot of
	the appearance lifting. Common failure modes:

	- `adaln_scale=0`: model ignores the reference's global look; you keep
	only what `ref_attn` and the IC-LoRA tokens can recover. Expect washed-out
	identity.
	- `adaln_scale=1.0` (ComfyUI default before the recent realignment):
	exactly half the training-time strength. Identity is still recognizable
	but visibly weaker.
	- `adaln_scale>3`: identity dominates and the model starts ignoring the
	prompt / guide motion.

	---

	## 5. `role_embedding` — present in both, behavior depends on which you load

	A learned `[1, 128]` vector that adds a fingerprint to the patchified
	tokens belonging to the IC-LoRA reference image, so the transformer can tell
	the ref token apart from generic guide / target tokens.

	### In `ref_adaln_proj-role_embedding`
	Was trained with `use_visual_ref_role_embedding=True` — that's where the
	non-zero value (~0.125 norm) comes from. The `attn1`/`attn2` adapters in
	this build therefore learned to recognize this bias.

	### In `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`
	Inherits the value from `ref_adaln_proj-role_embedding` but trains with
	`use_visual_ref_role_embedding=False`, meaning the bias **is never added
	during training**. The vector is frozen at its inherited value; wandb shows
	its norm flat at 0.125 across the whole run.

	### Inference rule

	When loading `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`: keep
	`enable_role_embedding=False`. Turning it on adds a bias to the ref
	tokens that this build never saw — the `attn1`/`attn2` adapters retrained
	without it, so the bias becomes adversarial noise and degrades the output.

	When loading `ref_adaln_proj-role_embedding` directly (no
	`..._ref_attn-ref_visual_proj` adapters), the opposite is true:
	`enable_role_embedding=True` matches the training distribution.

	---

	## 6. Quick reference: what each knob does at inference

	\| Knob \| `..._ref_attn-ref_visual_proj` training value \| Effect of raising it \| Effect of lowering it \|
	\|---\|---\|---\|---\|
	\| `adaln_scale` \| 2.0 \| stronger global look \| identity fades \|
	\| `ref_context_scale` \| 0.01 \| sharper fine-grained ID; can over-freeze \| local detail blurs back to base \|
	\| `ref_token_scale` \| 0.25 \| more "voice" for the visual tokens in attention \| `ref_attn` becomes a no-op \|
	\| `ref_start_block` / `ref_end_block` \| 12 / 35 \| (do not change) \| (do not change) — outside this range the LoRA is untrained \|
	\| `enable_role_embedding` \| False \| adds out-of-distribution bias to ref tokens \| matches training \|
	\| `role_strength` \| n/a \| only matters if `enable_role_embedding=True` \| \|
	\| Standard LoRA `strength_model` \| 1.0 \| over-fits to training distribution \| drifts back toward base LTX-2 \|

	The combination that mirrors training of the
	`..._ref_attn-ref_visual_proj` build exactly: `adaln_scale=2.0,
	ref_context_scale=0.01, ref_token_scale=0.25, ref_start_block=12,
	ref_end_block=35, enable_role_embedding=False, ref_init_from=attn2,
	strength_model=1.0`.

	---

	## 7. Where the loaded files come from

	`scripts/split_editanything_lora.py` produces two safetensors per checkpoint.
	The filename suffix lists every extra that ended up in the module sidecar
	(fixed order: `ref_adaln_proj`, `role_embedding`, `ref_attn`,
	`ref_visual_proj`), so you can tell which mechanisms each pair carries
	without opening the file.

	Canonical pairs:

	```
	edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.standard.safetensors
	edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.module.safetensors

	edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.standard.safetensors
	edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.module.safetensors
	```

	Feed the `.standard.*` into ComfyUI's standard LoRA loader and the
	`.module.*` into `LTXVEditAnythingModuleLoader`. Mixing pairs across builds
	(e.g., `ref_adaln_proj-role_embedding.standard.*` with
	`..._ref_attn-ref_visual_proj.module.*`) is not supported — the LoRA deltas
	were trained against the partner adapters in the same build.