Alissonerdx commited on
Commit
775562c
Β·
verified Β·
1 Parent(s): 8eb5c2a

Upload folder using huggingface_hub

Browse files
edit_anything_30k_v1.1_motion_transfer_r128.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a5d01c404594cb12e69926a9ae066d01bd1115abd345e09254c391040b226471
3
+ size 1308816336
edit_anything_30k_v1.1_motion_transfer_r256.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:407e9eed49bd5df627d68ed5eb4cfddc0353e8d133e65ad23670b4439c5faef0
3
+ size 2617440424
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.module.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:63ffdeed38c191108229ec3085386ac10174a0730427f86ef2c20dec4c6ea663
3
+ size 450782608
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.standard.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0e2e51d9eafd6636c9e752300578447344925b05bb5254a405302d3a6f9c668d
3
+ size 1308756368
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.module.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6f9d4483480f9766528553e9f5e61f6683d315da8c037ff23ac5e825908fed7c
3
+ size 38086368
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.standard.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:11b69d939077ad48de24f3fbd02c7ecdfdf7db029c9dc694167e7063c61f650e
3
+ size 1308756368
lora_layers_impact.md ADDED
@@ -0,0 +1,284 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Functional differences between the two builds and what each layer does
2
+
3
+ Companion to `lora_layers_reference.md`. That file is the inventory; this one
4
+ explains the **functional role** of every group of tensors and the **expected
5
+ behavioral impact** of toggling each branch at inference.
6
+
7
+ Two builds of the `edit_anything_reference_v0.1_r128` LoRA exist, each
8
+ delivered as a `(.standard, .module)` pair. The pairs are distinguished by
9
+ their **extras suffix**:
10
+
11
+ - `..._ref_adaln_proj-role_embedding.{standard,module}.safetensors` β€” the
12
+ original build. One mechanism for steering the model toward the reference
13
+ image: **global AdaLN appearance anchoring** (plus the IC-LoRA-style ref
14
+ tokens packed into the sequence).
15
+ - `..._ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.{standard,module}.safetensors`
16
+ β€” the continuation. Keeps everything from the original build and adds
17
+ **two new mechanisms** that operate on different time/space scales.
18
+
19
+ In the rest of this doc the two builds are referred to by their suffix only:
20
+ - `ref_adaln_proj-role_embedding`
21
+ - `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`
22
+
23
+ ---
24
+
25
+ ## TL;DR
26
+
27
+ | Branch | Where it acts | What it controls | New in `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`? |
28
+ |---|---|---|---|
29
+ | `attn1` LoRA | self-attention inside every block | scene cohesion, structural editing | no (carried over, frozen) |
30
+ | `attn2` LoRA | cross-attention to **text** (Gemma) | prompt following | no (re-trained) |
31
+ | `ff` LoRA | feed-forward MLP | feature mixing / capacity | no (re-trained) |
32
+ | **`ref_attn` LoRA** | dedicated cross-attention to **32 visual memory tokens** | preserving fine-grained appearance of the reference | **yes** |
33
+ | **`ref_visual_proj`** | projects the ref VAE latent into 32 context tokens | the *content* that `ref_attn` attends to | **yes** |
34
+ | `ref_adaln_proj` | produces a global vector added to the timestep AdaLN | overall color/style/identity bias | retrained (new pooling) |
35
+ | `role_embedding` | adds a 128-dim bias to ref tokens in the IC-LoRA sequence | tells the transformer "this token is the reference" | frozen in the continuation |
36
+
37
+ So:
38
+ - `ref_adaln_proj-role_embedding` only had a **slow, global** appearance signal
39
+ (AdaLN) plus the IC-LoRA-style ref tokens packed into the sequence.
40
+ - `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` adds a **fast,
41
+ local** appearance signal (visual cross-attention) that injects the
42
+ reference's actual textures into every block in the 12 β†’ 35 range.
43
+
44
+ ---
45
+
46
+ ## 1. The 10 modules shared between both builds
47
+
48
+ These cover the full 48-block transformer and were retrained in
49
+ `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` (except `attn1.*`,
50
+ which is loaded but frozen β€” see the training freeze policy in the inventory).
51
+
52
+ ### `attn1.{q,k,v,out.0}` β€” self-attention
53
+
54
+ Every transformer block first does self-attention over the latent video
55
+ tokens. The LoRA here adjusts how tokens relate to each other:
56
+ - **structural consistency** of the generated frames,
57
+ - **how strongly the `@reference` IC-LoRA token influences neighboring
58
+ spatial positions**,
59
+ - low-level look (sharpness, contrast).
60
+
61
+ In the `..._ref_attn-ref_visual_proj` build these are frozen on purpose so
62
+ the original priors over motion and structure stay intact. If the inference
63
+ output looks structurally broken (jitter, motion drift, layout collapse),
64
+ you probably misloaded these adapters or the standard LoRA is at the wrong
65
+ strength.
66
+
67
+ ### `attn2.{q,k,v,out.0}` β€” cross-attention to text
68
+
69
+ This is the prompt-following path. The Gemma text embedding is the K/V; the
70
+ video latent is the Q. The LoRA tunes how the prompt drives the edit.
71
+
72
+ - Stronger `attn2` deltas β‡’ the model **leans more on the prompt** ("Add
73
+ @reference sleeping on the armrest"). Useful for compositional control.
74
+ - If you disable or weaken the standard LoRA (e.g. `strength_model=0`), the
75
+ base model goes back to ignoring your edit instructions β€” even if `ref_attn`
76
+ is still active, the prompt-binding is gone.
77
+
78
+ ### `ff.net.{0.proj, 2}` β€” MLP capacity
79
+
80
+ The block's feed-forward part. The LoRA here adds **representational
81
+ capacity** to absorb the new behaviors that prompt + reference impose. There
82
+ is no single user-visible "knob" for this; it works behind the scenes.
83
+
84
+ If you slash its strength you'll see colors and textures drift back toward
85
+ generic LTX-2 outputs.
86
+
87
+ ---
88
+
89
+ ## 2. The new `ref_attn` branch
90
+
91
+ This is the heart of the change in
92
+ `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`. Each of the 48
93
+ transformer blocks now has a *fourth* attention head, `ref_attn`, in
94
+ addition to `attn1` (self) and `attn2` (text). `ref_attn` cross-attends from
95
+ the noisy video latent (Q) to **a small set of visual memory tokens
96
+ computed from the reference image** (K/V).
97
+
98
+ ### Why three projections (q/k/v/out.0)
99
+
100
+ A standard cross-attention. The base weights are copied from `attn2` at load
101
+ time (`init_ref_attn_from: attn2`) so the module starts as "text cross-attn,
102
+ but pointed at visual tokens"; the LoRA then teaches it to actually
103
+ *use* those visual tokens.
104
+
105
+ ### Per-block gating
106
+
107
+ `ref_attn` is only consulted in blocks **12 β†’ 35** (this is what
108
+ `ref_start_block` / `ref_end_block` enforce at inference and what the trainer
109
+ used during fine-tuning). Skipping blocks 0–11 keeps the early low-level
110
+ features untouched; skipping blocks 36–47 lets the late decoding stages do
111
+ their job without extra visual bias.
112
+
113
+ ### Impact
114
+
115
+ - **Strong identity preservation** for things the AdaLN anchor can't capture
116
+ (small logos, eye color, fur texture, asymmetric details).
117
+ - Scaled by `ref_context_scale` (training default `0.01`). Small for a
118
+ reason: the visual tokens are dense, and the residual is added on top of
119
+ every block in the 12–35 range β€” even at 0.01 the cumulative effect is
120
+ meaningful.
121
+ - Doubling the scale (β†’ 0.02) usually intensifies identity at the cost of
122
+ motion fidelity; going to 0.05+ tends to "freeze" parts of the scene to the
123
+ reference appearance.
124
+ - Setting `ref_start_block=0` is **destructive**: blocks 0–11 never saw
125
+ `ref_context` during training, so injecting it there feeds the model
126
+ noise β€” outputs collapse to black or random patterns.
127
+
128
+ ---
129
+
130
+ ## 3. The new `ref_visual_proj`
131
+
132
+ This is the *source* of what `ref_attn` attends to. Without it the
133
+ `ref_attn` LoRA is useless β€” there are no visual tokens to read.
134
+
135
+ ### Forward
136
+
137
+ ```
138
+ ref_frame = mean over time of the ref VAE latent # [B, 128, H, W]
139
+ local = adaptive_avg_pool to (4, 8) # 32 spatial cells
140
+ global_mean, global_std over the whole frame # 2 Γ— 128
141
+ tokens = concat(local, broadcast(mean,std)) # [B, 32, 384]
142
+ tokens = proj(silu(fc1(tokens))) # [B, 32, 4096]
143
+ tokens = LayerNorm(tokens)
144
+ tokens = tokens + pos_embed[:, :32]
145
+ return tokens * token_scale # 0.25 in training
146
+ ```
147
+
148
+ ### Layer-by-layer impact
149
+
150
+ | Tensor | What it controls | If perturbed |
151
+ |---|---|---|
152
+ | `fc1.weight / bias` (1024Γ—384) | maps the 384-dim raw appearance descriptor into the projector's hidden space | weights here decide *which* aspects of the pooled appearance survive (e.g. color vs. texture vs. luminance) |
153
+ | `proj.weight / bias` (4096Γ—1024) | lifts the hidden vector into the transformer context dim | initialized with small gain (0.05) so the branch starts almost-no-op; loaded from training |
154
+ | `norm.weight / bias` (4096) | LayerNorm on the projected tokens | keeps numerical range consistent across reference images so `ref_attn` works at the same scale regardless of input statistics |
155
+ | `pos_embed` (1, 32, 4096) | per-position bias for the 32 memory tokens | the model uses this to distinguish "top-left cell" from "bottom-right cell" β€” without it, all 32 tokens would be permutation-invariant and `ref_attn` would degenerate |
156
+
157
+ ### `ref_token_scale` (training = 0.25)
158
+
159
+ This is the runtime multiplier on the output. It is **not** a stored tensor
160
+ but a knob in the inference node. Doubling it (β†’ 0.5) effectively doubles
161
+ the K/V magnitude that `ref_attn` reads, which biases attention scores
162
+ toward the reference tokens. Combined with `ref_context_scale`, you have
163
+ two independent ways to over-/under-amplify the visual reference branch.
164
+
165
+ ---
166
+
167
+ ## 4. `ref_adaln_proj` β€” *retrained, not continued*
168
+
169
+ Both builds have this projector, but **the input dimension changed**:
170
+
171
+ | | `ref_adaln_proj-role_embedding` | `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` |
172
+ |---|---|---|
173
+ | Pooling | `avg_1x1 β€– max_1x1` (2-scale) | `avg_1x1 β€– avg_2x2 β€– max_1x1` (3-scale) |
174
+ | `fc1.weight` shape | (512, **256**) | (512, **768**) |
175
+
176
+ Because of the shape mismatch the trainer **reinitializes** `ref_adaln_proj`
177
+ from scratch when continuing from `ref_adaln_proj-role_embedding`. The
178
+ `ref_adaln_proj` in the continuation is not a fine-tune of the original β€” it
179
+ learned fresh. wandb confirms this: `ref_proj/weight_norm` ramps from
180
+ near-zero to ~2.9.
181
+
182
+ ### What it actually does
183
+
184
+ Builds one **per-sample** vector that is **added to the timestep bias** fed
185
+ into every transformer block's AdaLN layer. The result: a persistent,
186
+ sample-wide "lean toward this reference" applied throughout denoising.
187
+
188
+ ### Why this is the *complement* of `ref_attn`
189
+
190
+ - `ref_attn` is **localized**: visual tokens cross-attend per spatial cell,
191
+ letting the model copy fine details.
192
+ - `ref_adaln_proj` is **global**: a single conditioning vector tints all 48
193
+ blocks uniformly. Best for "the overall look of the output should remind
194
+ me of this reference" (palette, lighting, broad style).
195
+
196
+ ### `adaln_scale` (training = 2.0)
197
+
198
+ The user-side multiplier. At training default 2.0, AdaLN is doing a lot of
199
+ the appearance lifting. Common failure modes:
200
+
201
+ - **`adaln_scale=0`**: model ignores the reference's global look; you keep
202
+ only what `ref_attn` and the IC-LoRA tokens can recover. Expect washed-out
203
+ identity.
204
+ - **`adaln_scale=1.0`** (ComfyUI default before the recent realignment):
205
+ exactly half the training-time strength. Identity is still recognizable
206
+ but visibly weaker.
207
+ - **`adaln_scale>3`**: identity dominates and the model starts ignoring the
208
+ prompt / guide motion.
209
+
210
+ ---
211
+
212
+ ## 5. `role_embedding` β€” present in both, behavior depends on which you load
213
+
214
+ A learned `[1, 128]` vector that **adds a fingerprint** to the patchified
215
+ tokens belonging to the IC-LoRA reference image, so the transformer can tell
216
+ the ref token apart from generic guide / target tokens.
217
+
218
+ ### In `ref_adaln_proj-role_embedding`
219
+ Was trained with `use_visual_ref_role_embedding=True` β€” that's where the
220
+ non-zero value (~0.125 norm) comes from. The `attn1`/`attn2` adapters in
221
+ this build therefore learned to *recognize* this bias.
222
+
223
+ ### In `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`
224
+ Inherits the value from `ref_adaln_proj-role_embedding` but trains with
225
+ `use_visual_ref_role_embedding=False`, meaning the bias **is never added
226
+ during training**. The vector is frozen at its inherited value; wandb shows
227
+ its norm flat at 0.125 across the whole run.
228
+
229
+ ### Inference rule
230
+
231
+ When loading `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`: keep
232
+ **`enable_role_embedding=False`**. Turning it on adds a bias to the ref
233
+ tokens that this build never saw β€” the `attn1`/`attn2` adapters retrained
234
+ without it, so the bias becomes adversarial noise and degrades the output.
235
+
236
+ When loading `ref_adaln_proj-role_embedding` directly (no
237
+ `..._ref_attn-ref_visual_proj` adapters), the opposite is true:
238
+ `enable_role_embedding=True` matches the training distribution.
239
+
240
+ ---
241
+
242
+ ## 6. Quick reference: what each knob does at inference
243
+
244
+ | Knob | `..._ref_attn-ref_visual_proj` training value | Effect of raising it | Effect of lowering it |
245
+ |---|---|---|---|
246
+ | `adaln_scale` | 2.0 | stronger global look | identity fades |
247
+ | `ref_context_scale` | 0.01 | sharper fine-grained ID; can over-freeze | local detail blurs back to base |
248
+ | `ref_token_scale` | 0.25 | more "voice" for the visual tokens in attention | `ref_attn` becomes a no-op |
249
+ | `ref_start_block` / `ref_end_block` | 12 / 35 | (do not change) | (do not change) β€” outside this range the LoRA is untrained |
250
+ | `enable_role_embedding` | False | adds out-of-distribution bias to ref tokens | matches training |
251
+ | `role_strength` | n/a | only matters if `enable_role_embedding=True` | |
252
+ | Standard LoRA `strength_model` | 1.0 | over-fits to training distribution | drifts back toward base LTX-2 |
253
+
254
+ The combination that mirrors training of the
255
+ `..._ref_attn-ref_visual_proj` build exactly: `adaln_scale=2.0,
256
+ ref_context_scale=0.01, ref_token_scale=0.25, ref_start_block=12,
257
+ ref_end_block=35, enable_role_embedding=False, ref_init_from=attn2,
258
+ strength_model=1.0`.
259
+
260
+ ---
261
+
262
+ ## 7. Where the loaded files come from
263
+
264
+ `scripts/split_editanything_lora.py` produces two safetensors per checkpoint.
265
+ The filename suffix lists every extra that ended up in the module sidecar
266
+ (fixed order: `ref_adaln_proj`, `role_embedding`, `ref_attn`,
267
+ `ref_visual_proj`), so you can tell which mechanisms each pair carries
268
+ without opening the file.
269
+
270
+ Canonical pairs:
271
+
272
+ ```
273
+ edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.standard.safetensors
274
+ edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.module.safetensors
275
+
276
+ edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.standard.safetensors
277
+ edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.module.safetensors
278
+ ```
279
+
280
+ Feed the `.standard.*` into ComfyUI's standard LoRA loader and the
281
+ `.module.*` into `LTXVEditAnythingModuleLoader`. Mixing pairs across builds
282
+ (e.g., `ref_adaln_proj-role_embedding.standard.*` with
283
+ `..._ref_attn-ref_visual_proj.module.*`) is not supported β€” the LoRA deltas
284
+ were trained against the partner adapters in the same build.
lora_layers_reference.md ADDED
@@ -0,0 +1,196 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LoRA Layer Inventory β€” Edit Anything checkpoints
2
+
3
+ Inventory of every tensor in two builds of the
4
+ `edit_anything_reference_v0.1_r128` LoRA.
5
+
6
+ Both builds share the same canonical basename
7
+ (`edit_anything_reference_v0.1_r128`) and are distinguished by the **extras
8
+ suffix** that `scripts/split_editanything_lora.py` appends to the output
9
+ filenames:
10
+
11
+ - `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.{standard,module}.safetensors`
12
+ β€” the original build. Only ships `ref_adaln_proj` + `role_embedding`.
13
+ - `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.{standard,module}.safetensors`
14
+ β€” the continuation, fine-tuned with the
15
+ `video_to_video_ref_visual_adaln` strategy. Adds the `ref_attn` LoRA
16
+ branch and the `ref_visual_proj` projector on top of the original
17
+ extras.
18
+
19
+ In the rest of this doc the two are referred to by their suffix only:
20
+ - `ref_adaln_proj-role_embedding`
21
+ - `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`
22
+
23
+ Rank is 128 in both (encoded in the LoRA tensor shapes; no `alpha` keys saved).
24
+ Dtype is `bfloat16` throughout. All LoRA modules cover **48 transformer blocks**.
25
+
26
+ ---
27
+
28
+ ## 1. Summary
29
+
30
+ | | `ref_adaln_proj-role_embedding` | `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` |
31
+ |---|---|---|
32
+ | Total tensors | 965 | 1356 |
33
+ | LoRA-target modules | **10** | **14** |
34
+ | LoRA tensors (A+B) | 960 | 1344 |
35
+ | Extra (non-LoRA) tensors | 5 | 12 |
36
+ | `ref_attn` LoRA branch | ❌ absent | βœ… trained on 48 blocks |
37
+ | `ref_visual_proj` (visual cross-attn projector) | ❌ absent | βœ… present (7 tensors) |
38
+ | `ref_adaln_proj` (global appearance AdaLN) | βœ… (fc1 input dim **256**) | βœ… (fc1 input dim **768**) |
39
+ | `role_embedding` | βœ… shape (1, 128) | βœ… shape (1, 128) |
40
+
41
+ ---
42
+
43
+ ## 2. LoRA adapters
44
+
45
+ Each row = one target module type. Each entry = (`lora_A.weight`, `lora_B.weight`)
46
+ duplicated across the 48 blocks of `diffusion_model.transformer_blocks.*`.
47
+
48
+ | Module | `ref_adaln_proj-role_embedding` | `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` | Notes |
49
+ |---|:---:|:---:|---|
50
+ | `attn1.to_q` | βœ… | βœ… | self-attention query |
51
+ | `attn1.to_k` | βœ… | βœ… | self-attention key |
52
+ | `attn1.to_v` | βœ… | βœ… | self-attention value |
53
+ | `attn1.to_out.0` | βœ… | βœ… | self-attention output proj |
54
+ | `attn2.to_q` | βœ… | βœ… | cross-attention to text (Gemma) |
55
+ | `attn2.to_k` | βœ… | βœ… | |
56
+ | `attn2.to_v` | βœ… | βœ… | |
57
+ | `attn2.to_out.0` | βœ… | βœ… | |
58
+ | `ff.net.0.proj` | βœ… | βœ… | feed-forward up-projection |
59
+ | `ff.net.2` | βœ… | βœ… | feed-forward down-projection |
60
+ | `ref_attn.to_q` | β€” | βœ… | **new** β€” visual reference cross-attention |
61
+ | `ref_attn.to_k` | β€” | βœ… | **new** |
62
+ | `ref_attn.to_v` | β€” | βœ… | **new** |
63
+ | `ref_attn.to_out.0` | β€” | βœ… | **new** |
64
+
65
+ **Key naming**: `diffusion_model.transformer_blocks.{0..47}.{module}.{lora_A|lora_B}.weight`
66
+
67
+ **Training freeze policy** for the
68
+ `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` build
69
+ (per `stage2_ref_visual_adaln_crossattn_from_v01_r128.yaml`):
70
+ - `attn1.*` adapters loaded from the `ref_adaln_proj-role_embedding` build
71
+ but **frozen** (`trainable_include_patterns` excludes them).
72
+ - `attn2.*`, `ff.*`, `ref_attn.*` are trainable.
73
+
74
+ ---
75
+
76
+ ## 3. Non-LoRA modules (the module sidecar)
77
+
78
+ These tensors live at the top of the state dict (no `transformer_blocks.*` prefix)
79
+ and are consumed by the custom inference path (`LTXVEditAnythingModuleLoader` +
80
+ `LTXVEditAnythingLoopingSampler`), not by the standard ComfyUI LoRA loader.
81
+
82
+ ### 3.1. `role_embedding` β€” appearance role bias
83
+
84
+ | Key | Shape | Notes |
85
+ |---|---|---|
86
+ | `role_embedding.embedding.weight` | (1, 128) | 1 slot (appearance). Padded to (3, 128) at inference; entry stored at slot 1 (ref_img role). |
87
+
88
+ Present in **both** builds with the same shape. In the
89
+ `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` build it is
90
+ **frozen** (`use_visual_ref_role_embedding: false`); wandb shows its norm
91
+ stays flat at ~0.125 throughout training.
92
+
93
+ ### 3.2. `ref_adaln_proj` β€” global AdaLN appearance anchor
94
+
95
+ Two-layer MLP that pools the reference latent into a vector added to every
96
+ block's AdaLN timestep bias.
97
+
98
+ | Key | `ref_adaln_proj-role_embedding` shape | `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` shape |
99
+ |---|---|---|
100
+ | `ref_adaln_proj.fc1.weight` | (512, **256**) | (512, **768**) |
101
+ | `ref_adaln_proj.fc1.bias` | (512,) | (512,) |
102
+ | `ref_adaln_proj.proj.weight` | (36864, 512) | (36864, 512) |
103
+ | `ref_adaln_proj.proj.bias` | (36864,) | (36864,) |
104
+
105
+ > ⚠️ **Shape mismatch on `fc1.weight`**.
106
+ > The `ref_adaln_proj-role_embedding` build was trained with a 2-scale pool
107
+ > (`avg_1x1 β€– max_1x1` β†’ 256-dim input).
108
+ > The `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` build was
109
+ > trained with a 3-scale pool (`avg_1x1 β€– avg_2x2 β€– max_1x1` β†’ 768-dim).
110
+ > Because of this incompatibility the trainer **reinitializes**
111
+ > `ref_adaln_proj` from scratch when continuing from
112
+ > `ref_adaln_proj-role_embedding`; the AdaLN projector in the continuation
113
+ > is **not** a fine-tune of the original one. The output dim 36864 = AdaLN
114
+ > param count for the LTX-2 transformer (read at runtime via
115
+ > `preprocessor.adaln.linear.out_features`).
116
+
117
+ ### 3.3. `ref_visual_proj` β€” visual cross-attention memory tokens
118
+
119
+ Present in `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` only.
120
+ `SafeVisualRefProjector` (training file `video_to_video_ref_visual.py`).
121
+ Produces 32 visual memory tokens consumed by the new `ref_attn` branch.
122
+
123
+ | Key | Shape | Notes |
124
+ |---|---|---|
125
+ | `ref_visual_proj.fc1.weight` | (1024, **384**) | input 384 = 128 (local pooled) + 128 (global mean) + 128 (global std) |
126
+ | `ref_visual_proj.fc1.bias` | (1024,) | xavier init gain 0.1 |
127
+ | `ref_visual_proj.proj.weight` | (4096, 1024) | maps to context_dim 4096; xavier init gain 0.05 |
128
+ | `ref_visual_proj.proj.bias` | (4096,) | |
129
+ | `ref_visual_proj.norm.weight` | (4096,) | LayerNorm Ξ³ |
130
+ | `ref_visual_proj.norm.bias` | (4096,) | LayerNorm Ξ² |
131
+ | `ref_visual_proj.pos_embed` | (1, 32, 4096) | per-token learned positional bias |
132
+
133
+ Forward (matches `SafeVisualRefProjector.forward`):
134
+ ```
135
+ tokens = local β€– global_mean β€– global_std # [B, 32, 384]
136
+ tokens = proj(silu(fc1(tokens))) # β†’ [B, 32, 4096]
137
+ tokens = LayerNorm(tokens)
138
+ tokens = tokens + pos_embed[:, :32]
139
+ return tokens * token_scale # training default 0.25
140
+ ```
141
+
142
+ Not present in `ref_adaln_proj-role_embedding` β€” this entire branch is new.
143
+
144
+ ---
145
+
146
+ ## 4. Total tensor counts (sanity check)
147
+
148
+ ### `ref_adaln_proj-role_embedding`
149
+ ```
150
+ LoRA: 10 modules Γ— 48 blocks Γ— 2 (A,B) = 960
151
+ ref_adaln_proj: 4 (fc1.{w,b}, proj.{w,b}) = 4
152
+ role_embedding: 1 = 1
153
+ total= 965 βœ“
154
+ ```
155
+
156
+ ### `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`
157
+ ```
158
+ LoRA: 14 modules Γ— 48 blocks Γ— 2 (A,B) = 1344
159
+ ref_adaln_proj: 4 = 4
160
+ ref_visual_proj: 7 = 7
161
+ role_embedding: 1 = 1
162
+ total= 1356 βœ“
163
+ ```
164
+
165
+ ---
166
+
167
+ ## 5. Loading checkpoint at inference
168
+
169
+ Use `scripts/split_editanything_lora.py` to split each raw training
170
+ checkpoint into:
171
+ - `*.standard.safetensors` β€” LoRA on `attn1/attn2/ff` only; safe to feed to
172
+ ComfyUI's standard LoraLoader.
173
+ - `*.module.safetensors` β€” everything else (`role_embedding`,
174
+ `ref_adaln_proj`, `ref_visual_proj`, `ref_attn` LoRA adapters); feed to
175
+ `LTXVEditAnythingModuleLoader`.
176
+
177
+ The filename suffix lists every extra that ended up in the module sidecar,
178
+ so it is obvious at a glance which mechanisms a given pair carries. Order is
179
+ fixed: `ref_adaln_proj`, `role_embedding`, `ref_attn`, `ref_visual_proj`.
180
+
181
+ ### Canonical output names
182
+
183
+ ```
184
+ edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.standard.safetensors
185
+ edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.module.safetensors
186
+
187
+ edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.standard.safetensors
188
+ edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.module.safetensors
189
+ ```
190
+
191
+ ### Command
192
+
193
+ ```bash
194
+ python3 /data/training/ltx-edit-trainer/scripts/split_editanything_lora.py \
195
+ <raw-checkpoint>.safetensors --output-dir <dir> [--overwrite]
196
+ ```