File size: 8,421 Bytes
775562c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
# LoRA Layer Inventory β€” Edit Anything checkpoints

Inventory of every tensor in two builds of the
`edit_anything_reference_v0.1_r128` LoRA.

Both builds share the same canonical basename
(`edit_anything_reference_v0.1_r128`) and are distinguished by the **extras
suffix** that `scripts/split_editanything_lora.py` appends to the output
filenames:

- `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.{standard,module}.safetensors`
  β€” the original build. Only ships `ref_adaln_proj` + `role_embedding`.
- `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.{standard,module}.safetensors`
  β€” the continuation, fine-tuned with the
    `video_to_video_ref_visual_adaln` strategy. Adds the `ref_attn` LoRA
    branch and the `ref_visual_proj` projector on top of the original
    extras.

In the rest of this doc the two are referred to by their suffix only:
- `ref_adaln_proj-role_embedding`
- `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`

Rank is 128 in both (encoded in the LoRA tensor shapes; no `alpha` keys saved).
Dtype is `bfloat16` throughout. All LoRA modules cover **48 transformer blocks**.

---

## 1. Summary

| | `ref_adaln_proj-role_embedding` | `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` |
|---|---|---|
| Total tensors | 965 | 1356 |
| LoRA-target modules | **10** | **14** |
| LoRA tensors (A+B) | 960 | 1344 |
| Extra (non-LoRA) tensors | 5 | 12 |
| `ref_attn` LoRA branch | ❌ absent | βœ… trained on 48 blocks |
| `ref_visual_proj` (visual cross-attn projector) | ❌ absent | βœ… present (7 tensors) |
| `ref_adaln_proj` (global appearance AdaLN) | βœ… (fc1 input dim **256**) | βœ… (fc1 input dim **768**) |
| `role_embedding` | βœ… shape (1, 128) | βœ… shape (1, 128) |

---

## 2. LoRA adapters

Each row = one target module type. Each entry = (`lora_A.weight`, `lora_B.weight`)
duplicated across the 48 blocks of `diffusion_model.transformer_blocks.*`.

| Module | `ref_adaln_proj-role_embedding` | `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` | Notes |
|---|:---:|:---:|---|
| `attn1.to_q` | βœ… | βœ… | self-attention query |
| `attn1.to_k` | βœ… | βœ… | self-attention key |
| `attn1.to_v` | βœ… | βœ… | self-attention value |
| `attn1.to_out.0` | βœ… | βœ… | self-attention output proj |
| `attn2.to_q` | βœ… | βœ… | cross-attention to text (Gemma) |
| `attn2.to_k` | βœ… | βœ… | |
| `attn2.to_v` | βœ… | βœ… | |
| `attn2.to_out.0` | βœ… | βœ… | |
| `ff.net.0.proj` | βœ… | βœ… | feed-forward up-projection |
| `ff.net.2` | βœ… | βœ… | feed-forward down-projection |
| `ref_attn.to_q` | β€” | βœ… | **new** β€” visual reference cross-attention |
| `ref_attn.to_k` | β€” | βœ… | **new** |
| `ref_attn.to_v` | β€” | βœ… | **new** |
| `ref_attn.to_out.0` | β€” | βœ… | **new** |

**Key naming**: `diffusion_model.transformer_blocks.{0..47}.{module}.{lora_A|lora_B}.weight`

**Training freeze policy** for the
`ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` build
(per `stage2_ref_visual_adaln_crossattn_from_v01_r128.yaml`):
- `attn1.*` adapters loaded from the `ref_adaln_proj-role_embedding` build
  but **frozen** (`trainable_include_patterns` excludes them).
- `attn2.*`, `ff.*`, `ref_attn.*` are trainable.

---

## 3. Non-LoRA modules (the module sidecar)

These tensors live at the top of the state dict (no `transformer_blocks.*` prefix)
and are consumed by the custom inference path (`LTXVEditAnythingModuleLoader` +
`LTXVEditAnythingLoopingSampler`), not by the standard ComfyUI LoRA loader.

### 3.1. `role_embedding` β€” appearance role bias

| Key | Shape | Notes |
|---|---|---|
| `role_embedding.embedding.weight` | (1, 128) | 1 slot (appearance). Padded to (3, 128) at inference; entry stored at slot 1 (ref_img role). |

Present in **both** builds with the same shape. In the
`ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` build it is
**frozen** (`use_visual_ref_role_embedding: false`); wandb shows its norm
stays flat at ~0.125 throughout training.

### 3.2. `ref_adaln_proj` β€” global AdaLN appearance anchor

Two-layer MLP that pools the reference latent into a vector added to every
block's AdaLN timestep bias.

| Key | `ref_adaln_proj-role_embedding` shape | `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` shape |
|---|---|---|
| `ref_adaln_proj.fc1.weight` | (512, **256**) | (512, **768**) |
| `ref_adaln_proj.fc1.bias` | (512,) | (512,) |
| `ref_adaln_proj.proj.weight` | (36864, 512) | (36864, 512) |
| `ref_adaln_proj.proj.bias` | (36864,) | (36864,) |

> ⚠️ **Shape mismatch on `fc1.weight`**.
> The `ref_adaln_proj-role_embedding` build was trained with a 2-scale pool
> (`avg_1x1 β€– max_1x1` β†’ 256-dim input).
> The `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` build was
> trained with a 3-scale pool (`avg_1x1 β€– avg_2x2 β€– max_1x1` β†’ 768-dim).
> Because of this incompatibility the trainer **reinitializes**
> `ref_adaln_proj` from scratch when continuing from
> `ref_adaln_proj-role_embedding`; the AdaLN projector in the continuation
> is **not** a fine-tune of the original one. The output dim 36864 = AdaLN
> param count for the LTX-2 transformer (read at runtime via
> `preprocessor.adaln.linear.out_features`).

### 3.3. `ref_visual_proj` β€” visual cross-attention memory tokens

Present in `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` only.
`SafeVisualRefProjector` (training file `video_to_video_ref_visual.py`).
Produces 32 visual memory tokens consumed by the new `ref_attn` branch.

| Key | Shape | Notes |
|---|---|---|
| `ref_visual_proj.fc1.weight` | (1024, **384**) | input 384 = 128 (local pooled) + 128 (global mean) + 128 (global std) |
| `ref_visual_proj.fc1.bias` | (1024,) | xavier init gain 0.1 |
| `ref_visual_proj.proj.weight` | (4096, 1024) | maps to context_dim 4096; xavier init gain 0.05 |
| `ref_visual_proj.proj.bias` | (4096,) | |
| `ref_visual_proj.norm.weight` | (4096,) | LayerNorm Ξ³ |
| `ref_visual_proj.norm.bias` | (4096,) | LayerNorm Ξ² |
| `ref_visual_proj.pos_embed` | (1, 32, 4096) | per-token learned positional bias |

Forward (matches `SafeVisualRefProjector.forward`):
```
tokens = local β€– global_mean β€– global_std          # [B, 32, 384]
tokens = proj(silu(fc1(tokens)))                   # β†’ [B, 32, 4096]
tokens = LayerNorm(tokens)
tokens = tokens + pos_embed[:, :32]
return tokens * token_scale                        # training default 0.25
```

Not present in `ref_adaln_proj-role_embedding` β€” this entire branch is new.

---

## 4. Total tensor counts (sanity check)

### `ref_adaln_proj-role_embedding`
```
LoRA: 10 modules Γ— 48 blocks Γ— 2 (A,B)            = 960
ref_adaln_proj: 4 (fc1.{w,b}, proj.{w,b})         =   4
role_embedding: 1                                 =   1
                                              total= 965 βœ“
```

### `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`
```
LoRA: 14 modules Γ— 48 blocks Γ— 2 (A,B)            = 1344
ref_adaln_proj: 4                                 =    4
ref_visual_proj: 7                                =    7
role_embedding: 1                                 =    1
                                              total= 1356 βœ“
```

---

## 5. Loading checkpoint at inference

Use `scripts/split_editanything_lora.py` to split each raw training
checkpoint into:
- `*.standard.safetensors` β€” LoRA on `attn1/attn2/ff` only; safe to feed to
  ComfyUI's standard LoraLoader.
- `*.module.safetensors` β€” everything else (`role_embedding`,
  `ref_adaln_proj`, `ref_visual_proj`, `ref_attn` LoRA adapters); feed to
  `LTXVEditAnythingModuleLoader`.

The filename suffix lists every extra that ended up in the module sidecar,
so it is obvious at a glance which mechanisms a given pair carries. Order is
fixed: `ref_adaln_proj`, `role_embedding`, `ref_attn`, `ref_visual_proj`.

### Canonical output names

```
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.standard.safetensors
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.module.safetensors

edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.standard.safetensors
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.module.safetensors
```

### Command

```bash
python3 /data/training/ltx-edit-trainer/scripts/split_editanything_lora.py \
  <raw-checkpoint>.safetensors --output-dir <dir> [--overwrite]
```