EditAnything / lora_layers_reference.md
Alissonerdx's picture
Upload folder using huggingface_hub
775562c verified
# LoRA Layer Inventory β€” Edit Anything checkpoints
Inventory of every tensor in two builds of the
`edit_anything_reference_v0.1_r128` LoRA.
Both builds share the same canonical basename
(`edit_anything_reference_v0.1_r128`) and are distinguished by the **extras
suffix** that `scripts/split_editanything_lora.py` appends to the output
filenames:
- `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.{standard,module}.safetensors`
β€” the original build. Only ships `ref_adaln_proj` + `role_embedding`.
- `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.{standard,module}.safetensors`
β€” the continuation, fine-tuned with the
`video_to_video_ref_visual_adaln` strategy. Adds the `ref_attn` LoRA
branch and the `ref_visual_proj` projector on top of the original
extras.
In the rest of this doc the two are referred to by their suffix only:
- `ref_adaln_proj-role_embedding`
- `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`
Rank is 128 in both (encoded in the LoRA tensor shapes; no `alpha` keys saved).
Dtype is `bfloat16` throughout. All LoRA modules cover **48 transformer blocks**.
---
## 1. Summary
| | `ref_adaln_proj-role_embedding` | `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` |
|---|---|---|
| Total tensors | 965 | 1356 |
| LoRA-target modules | **10** | **14** |
| LoRA tensors (A+B) | 960 | 1344 |
| Extra (non-LoRA) tensors | 5 | 12 |
| `ref_attn` LoRA branch | ❌ absent | βœ… trained on 48 blocks |
| `ref_visual_proj` (visual cross-attn projector) | ❌ absent | βœ… present (7 tensors) |
| `ref_adaln_proj` (global appearance AdaLN) | βœ… (fc1 input dim **256**) | βœ… (fc1 input dim **768**) |
| `role_embedding` | βœ… shape (1, 128) | βœ… shape (1, 128) |
---
## 2. LoRA adapters
Each row = one target module type. Each entry = (`lora_A.weight`, `lora_B.weight`)
duplicated across the 48 blocks of `diffusion_model.transformer_blocks.*`.
| Module | `ref_adaln_proj-role_embedding` | `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` | Notes |
|---|:---:|:---:|---|
| `attn1.to_q` | βœ… | βœ… | self-attention query |
| `attn1.to_k` | βœ… | βœ… | self-attention key |
| `attn1.to_v` | βœ… | βœ… | self-attention value |
| `attn1.to_out.0` | βœ… | βœ… | self-attention output proj |
| `attn2.to_q` | βœ… | βœ… | cross-attention to text (Gemma) |
| `attn2.to_k` | βœ… | βœ… | |
| `attn2.to_v` | βœ… | βœ… | |
| `attn2.to_out.0` | βœ… | βœ… | |
| `ff.net.0.proj` | βœ… | βœ… | feed-forward up-projection |
| `ff.net.2` | βœ… | βœ… | feed-forward down-projection |
| `ref_attn.to_q` | β€” | βœ… | **new** β€” visual reference cross-attention |
| `ref_attn.to_k` | β€” | βœ… | **new** |
| `ref_attn.to_v` | β€” | βœ… | **new** |
| `ref_attn.to_out.0` | β€” | βœ… | **new** |
**Key naming**: `diffusion_model.transformer_blocks.{0..47}.{module}.{lora_A|lora_B}.weight`
**Training freeze policy** for the
`ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` build
(per `stage2_ref_visual_adaln_crossattn_from_v01_r128.yaml`):
- `attn1.*` adapters loaded from the `ref_adaln_proj-role_embedding` build
but **frozen** (`trainable_include_patterns` excludes them).
- `attn2.*`, `ff.*`, `ref_attn.*` are trainable.
---
## 3. Non-LoRA modules (the module sidecar)
These tensors live at the top of the state dict (no `transformer_blocks.*` prefix)
and are consumed by the custom inference path (`LTXVEditAnythingModuleLoader` +
`LTXVEditAnythingLoopingSampler`), not by the standard ComfyUI LoRA loader.
### 3.1. `role_embedding` β€” appearance role bias
| Key | Shape | Notes |
|---|---|---|
| `role_embedding.embedding.weight` | (1, 128) | 1 slot (appearance). Padded to (3, 128) at inference; entry stored at slot 1 (ref_img role). |
Present in **both** builds with the same shape. In the
`ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` build it is
**frozen** (`use_visual_ref_role_embedding: false`); wandb shows its norm
stays flat at ~0.125 throughout training.
### 3.2. `ref_adaln_proj` β€” global AdaLN appearance anchor
Two-layer MLP that pools the reference latent into a vector added to every
block's AdaLN timestep bias.
| Key | `ref_adaln_proj-role_embedding` shape | `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` shape |
|---|---|---|
| `ref_adaln_proj.fc1.weight` | (512, **256**) | (512, **768**) |
| `ref_adaln_proj.fc1.bias` | (512,) | (512,) |
| `ref_adaln_proj.proj.weight` | (36864, 512) | (36864, 512) |
| `ref_adaln_proj.proj.bias` | (36864,) | (36864,) |
> ⚠️ **Shape mismatch on `fc1.weight`**.
> The `ref_adaln_proj-role_embedding` build was trained with a 2-scale pool
> (`avg_1x1 β€– max_1x1` β†’ 256-dim input).
> The `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` build was
> trained with a 3-scale pool (`avg_1x1 β€– avg_2x2 β€– max_1x1` β†’ 768-dim).
> Because of this incompatibility the trainer **reinitializes**
> `ref_adaln_proj` from scratch when continuing from
> `ref_adaln_proj-role_embedding`; the AdaLN projector in the continuation
> is **not** a fine-tune of the original one. The output dim 36864 = AdaLN
> param count for the LTX-2 transformer (read at runtime via
> `preprocessor.adaln.linear.out_features`).
### 3.3. `ref_visual_proj` β€” visual cross-attention memory tokens
Present in `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` only.
`SafeVisualRefProjector` (training file `video_to_video_ref_visual.py`).
Produces 32 visual memory tokens consumed by the new `ref_attn` branch.
| Key | Shape | Notes |
|---|---|---|
| `ref_visual_proj.fc1.weight` | (1024, **384**) | input 384 = 128 (local pooled) + 128 (global mean) + 128 (global std) |
| `ref_visual_proj.fc1.bias` | (1024,) | xavier init gain 0.1 |
| `ref_visual_proj.proj.weight` | (4096, 1024) | maps to context_dim 4096; xavier init gain 0.05 |
| `ref_visual_proj.proj.bias` | (4096,) | |
| `ref_visual_proj.norm.weight` | (4096,) | LayerNorm Ξ³ |
| `ref_visual_proj.norm.bias` | (4096,) | LayerNorm Ξ² |
| `ref_visual_proj.pos_embed` | (1, 32, 4096) | per-token learned positional bias |
Forward (matches `SafeVisualRefProjector.forward`):
```
tokens = local β€– global_mean β€– global_std # [B, 32, 384]
tokens = proj(silu(fc1(tokens))) # β†’ [B, 32, 4096]
tokens = LayerNorm(tokens)
tokens = tokens + pos_embed[:, :32]
return tokens * token_scale # training default 0.25
```
Not present in `ref_adaln_proj-role_embedding` β€” this entire branch is new.
---
## 4. Total tensor counts (sanity check)
### `ref_adaln_proj-role_embedding`
```
LoRA: 10 modules Γ— 48 blocks Γ— 2 (A,B) = 960
ref_adaln_proj: 4 (fc1.{w,b}, proj.{w,b}) = 4
role_embedding: 1 = 1
total= 965 βœ“
```
### `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`
```
LoRA: 14 modules Γ— 48 blocks Γ— 2 (A,B) = 1344
ref_adaln_proj: 4 = 4
ref_visual_proj: 7 = 7
role_embedding: 1 = 1
total= 1356 βœ“
```
---
## 5. Loading checkpoint at inference
Use `scripts/split_editanything_lora.py` to split each raw training
checkpoint into:
- `*.standard.safetensors` β€” LoRA on `attn1/attn2/ff` only; safe to feed to
ComfyUI's standard LoraLoader.
- `*.module.safetensors` β€” everything else (`role_embedding`,
`ref_adaln_proj`, `ref_visual_proj`, `ref_attn` LoRA adapters); feed to
`LTXVEditAnythingModuleLoader`.
The filename suffix lists every extra that ended up in the module sidecar,
so it is obvious at a glance which mechanisms a given pair carries. Order is
fixed: `ref_adaln_proj`, `role_embedding`, `ref_attn`, `ref_visual_proj`.
### Canonical output names
```
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.standard.safetensors
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.module.safetensors
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.standard.safetensors
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.module.safetensors
```
### Command
```bash
python3 /data/training/ltx-edit-trainer/scripts/split_editanything_lora.py \
<raw-checkpoint>.safetensors --output-dir <dir> [--overwrite]
```