File size: 8,421 Bytes

775562c

# LoRA Layer Inventory — Edit Anything checkpoints

Inventory of every tensor in two builds of the
`edit_anything_reference_v0.1_r128` LoRA.

Both builds share the same canonical basename
(`edit_anything_reference_v0.1_r128`) and are distinguished by the **extras
suffix** that `scripts/split_editanything_lora.py` appends to the output
filenames:

- `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.{standard,module}.safetensors`
  — the original build. Only ships `ref_adaln_proj` + `role_embedding`.
- `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.{standard,module}.safetensors`
  — the continuation, fine-tuned with the
    `video_to_video_ref_visual_adaln` strategy. Adds the `ref_attn` LoRA
    branch and the `ref_visual_proj` projector on top of the original
    extras.

In the rest of this doc the two are referred to by their suffix only:
- `ref_adaln_proj-role_embedding`
- `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`

Rank is 128 in both (encoded in the LoRA tensor shapes; no `alpha` keys saved).
Dtype is `bfloat16` throughout. All LoRA modules cover **48 transformer blocks**.

---

## 1. Summary

| | `ref_adaln_proj-role_embedding` | `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` |
|---|---|---|
| Total tensors | 965 | 1356 |
| LoRA-target modules | **10** | **14** |
| LoRA tensors (A+B) | 960 | 1344 |
| Extra (non-LoRA) tensors | 5 | 12 |
| `ref_attn` LoRA branch | ❌ absent | ✅ trained on 48 blocks |
| `ref_visual_proj` (visual cross-attn projector) | ❌ absent | ✅ present (7 tensors) |
| `ref_adaln_proj` (global appearance AdaLN) | ✅ (fc1 input dim **256**) | ✅ (fc1 input dim **768**) |
| `role_embedding` | ✅ shape (1, 128) | ✅ shape (1, 128) |

---

## 2. LoRA adapters

Each row = one target module type. Each entry = (`lora_A.weight`, `lora_B.weight`)
duplicated across the 48 blocks of `diffusion_model.transformer_blocks.*`.

| Module | `ref_adaln_proj-role_embedding` | `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` | Notes |
|---|:---:|:---:|---|
| `attn1.to_q` | ✅ | ✅ | self-attention query |
| `attn1.to_k` | ✅ | ✅ | self-attention key |
| `attn1.to_v` | ✅ | ✅ | self-attention value |
| `attn1.to_out.0` | ✅ | ✅ | self-attention output proj |
| `attn2.to_q` | ✅ | ✅ | cross-attention to text (Gemma) |
| `attn2.to_k` | ✅ | ✅ | |
| `attn2.to_v` | ✅ | ✅ | |
| `attn2.to_out.0` | ✅ | ✅ | |
| `ff.net.0.proj` | ✅ | ✅ | feed-forward up-projection |
| `ff.net.2` | ✅ | ✅ | feed-forward down-projection |
| `ref_attn.to_q` | — | ✅ | **new** — visual reference cross-attention |
| `ref_attn.to_k` | — | ✅ | **new** |
| `ref_attn.to_v` | — | ✅ | **new** |
| `ref_attn.to_out.0` | — | ✅ | **new** |

**Key naming**: `diffusion_model.transformer_blocks.{0..47}.{module}.{lora_A|lora_B}.weight`

**Training freeze policy** for the
`ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` build
(per `stage2_ref_visual_adaln_crossattn_from_v01_r128.yaml`):
- `attn1.*` adapters loaded from the `ref_adaln_proj-role_embedding` build
  but **frozen** (`trainable_include_patterns` excludes them).
- `attn2.*`, `ff.*`, `ref_attn.*` are trainable.

---

## 3. Non-LoRA modules (the module sidecar)

These tensors live at the top of the state dict (no `transformer_blocks.*` prefix)
and are consumed by the custom inference path (`LTXVEditAnythingModuleLoader` +
`LTXVEditAnythingLoopingSampler`), not by the standard ComfyUI LoRA loader.

### 3.1. `role_embedding` — appearance role bias

| Key | Shape | Notes |
|---|---|---|
| `role_embedding.embedding.weight` | (1, 128) | 1 slot (appearance). Padded to (3, 128) at inference; entry stored at slot 1 (ref_img role). |

Present in **both** builds with the same shape. In the
`ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` build it is
**frozen** (`use_visual_ref_role_embedding: false`); wandb shows its norm
stays flat at ~0.125 throughout training.

### 3.2. `ref_adaln_proj` — global AdaLN appearance anchor

Two-layer MLP that pools the reference latent into a vector added to every
block's AdaLN timestep bias.

| Key | `ref_adaln_proj-role_embedding` shape | `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` shape |
|---|---|---|
| `ref_adaln_proj.fc1.weight` | (512, **256**) | (512, **768**) |
| `ref_adaln_proj.fc1.bias` | (512,) | (512,) |
| `ref_adaln_proj.proj.weight` | (36864, 512) | (36864, 512) |
| `ref_adaln_proj.proj.bias` | (36864,) | (36864,) |

> ⚠️ **Shape mismatch on `fc1.weight`**.
> The `ref_adaln_proj-role_embedding` build was trained with a 2-scale pool
> (`avg_1x1 ‖ max_1x1` → 256-dim input).
> The `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` build was
> trained with a 3-scale pool (`avg_1x1 ‖ avg_2x2 ‖ max_1x1` → 768-dim).
> Because of this incompatibility the trainer **reinitializes**
> `ref_adaln_proj` from scratch when continuing from
> `ref_adaln_proj-role_embedding`; the AdaLN projector in the continuation
> is **not** a fine-tune of the original one. The output dim 36864 = AdaLN
> param count for the LTX-2 transformer (read at runtime via
> `preprocessor.adaln.linear.out_features`).

### 3.3. `ref_visual_proj` — visual cross-attention memory tokens

Present in `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` only.
`SafeVisualRefProjector` (training file `video_to_video_ref_visual.py`).
Produces 32 visual memory tokens consumed by the new `ref_attn` branch.

| Key | Shape | Notes |
|---|---|---|
| `ref_visual_proj.fc1.weight` | (1024, **384**) | input 384 = 128 (local pooled) + 128 (global mean) + 128 (global std) |
| `ref_visual_proj.fc1.bias` | (1024,) | xavier init gain 0.1 |
| `ref_visual_proj.proj.weight` | (4096, 1024) | maps to context_dim 4096; xavier init gain 0.05 |
| `ref_visual_proj.proj.bias` | (4096,) | |
| `ref_visual_proj.norm.weight` | (4096,) | LayerNorm γ |
| `ref_visual_proj.norm.bias` | (4096,) | LayerNorm β |
| `ref_visual_proj.pos_embed` | (1, 32, 4096) | per-token learned positional bias |

Forward (matches `SafeVisualRefProjector.forward`):
```
tokens = local ‖ global_mean ‖ global_std          # [B, 32, 384]
tokens = proj(silu(fc1(tokens)))                   # → [B, 32, 4096]
tokens = LayerNorm(tokens)
tokens = tokens + pos_embed[:, :32]
return tokens * token_scale                        # training default 0.25
```

Not present in `ref_adaln_proj-role_embedding` — this entire branch is new.

---

## 4. Total tensor counts (sanity check)

### `ref_adaln_proj-role_embedding`
```
LoRA: 10 modules × 48 blocks × 2 (A,B)            = 960
ref_adaln_proj: 4 (fc1.{w,b}, proj.{w,b})         =   4
role_embedding: 1                                 =   1
                                              total= 965 ✓
```

### `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`
```
LoRA: 14 modules × 48 blocks × 2 (A,B)            = 1344
ref_adaln_proj: 4                                 =    4
ref_visual_proj: 7                                =    7
role_embedding: 1                                 =    1
                                              total= 1356 ✓
```

---

## 5. Loading checkpoint at inference

Use `scripts/split_editanything_lora.py` to split each raw training
checkpoint into:
- `*.standard.safetensors` — LoRA on `attn1/attn2/ff` only; safe to feed to
  ComfyUI's standard LoraLoader.
- `*.module.safetensors` — everything else (`role_embedding`,
  `ref_adaln_proj`, `ref_visual_proj`, `ref_attn` LoRA adapters); feed to
  `LTXVEditAnythingModuleLoader`.

The filename suffix lists every extra that ended up in the module sidecar,
so it is obvious at a glance which mechanisms a given pair carries. Order is
fixed: `ref_adaln_proj`, `role_embedding`, `ref_attn`, `ref_visual_proj`.

### Canonical output names

```
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.standard.safetensors
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.module.safetensors

edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.standard.safetensors
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.module.safetensors
```

### Command

```bash
python3 /data/training/ltx-edit-trainer/scripts/split_editanything_lora.py \
  <raw-checkpoint>.safetensors --output-dir <dir> [--overwrite]
```