Instructions to use Alissonerdx/EditAnything with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use Alissonerdx/EditAnything with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("Lightricks/LTX-2.3", dtype=torch.bfloat16, device_map="cuda") pipe.load_lora_weights("Alissonerdx/EditAnything") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Draw Things
Upload folder using huggingface_hub
Browse files- edit_anything_30k_v1.1_motion_transfer_r128.safetensors +3 -0
- edit_anything_30k_v1.1_motion_transfer_r256.safetensors +3 -0
- edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.module.safetensors +3 -0
- edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.standard.safetensors +3 -0
- edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.module.safetensors +3 -0
- edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.standard.safetensors +3 -0
- lora_layers_impact.md +284 -0
- lora_layers_reference.md +196 -0
edit_anything_30k_v1.1_motion_transfer_r128.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a5d01c404594cb12e69926a9ae066d01bd1115abd345e09254c391040b226471
|
| 3 |
+
size 1308816336
|
edit_anything_30k_v1.1_motion_transfer_r256.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:407e9eed49bd5df627d68ed5eb4cfddc0353e8d133e65ad23670b4439c5faef0
|
| 3 |
+
size 2617440424
|
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.module.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:63ffdeed38c191108229ec3085386ac10174a0730427f86ef2c20dec4c6ea663
|
| 3 |
+
size 450782608
|
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.standard.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0e2e51d9eafd6636c9e752300578447344925b05bb5254a405302d3a6f9c668d
|
| 3 |
+
size 1308756368
|
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.module.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:6f9d4483480f9766528553e9f5e61f6683d315da8c037ff23ac5e825908fed7c
|
| 3 |
+
size 38086368
|
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.standard.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:11b69d939077ad48de24f3fbd02c7ecdfdf7db029c9dc694167e7063c61f650e
|
| 3 |
+
size 1308756368
|
lora_layers_impact.md
ADDED
|
@@ -0,0 +1,284 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Functional differences between the two builds and what each layer does
|
| 2 |
+
|
| 3 |
+
Companion to `lora_layers_reference.md`. That file is the inventory; this one
|
| 4 |
+
explains the **functional role** of every group of tensors and the **expected
|
| 5 |
+
behavioral impact** of toggling each branch at inference.
|
| 6 |
+
|
| 7 |
+
Two builds of the `edit_anything_reference_v0.1_r128` LoRA exist, each
|
| 8 |
+
delivered as a `(.standard, .module)` pair. The pairs are distinguished by
|
| 9 |
+
their **extras suffix**:
|
| 10 |
+
|
| 11 |
+
- `..._ref_adaln_proj-role_embedding.{standard,module}.safetensors` β the
|
| 12 |
+
original build. One mechanism for steering the model toward the reference
|
| 13 |
+
image: **global AdaLN appearance anchoring** (plus the IC-LoRA-style ref
|
| 14 |
+
tokens packed into the sequence).
|
| 15 |
+
- `..._ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.{standard,module}.safetensors`
|
| 16 |
+
β the continuation. Keeps everything from the original build and adds
|
| 17 |
+
**two new mechanisms** that operate on different time/space scales.
|
| 18 |
+
|
| 19 |
+
In the rest of this doc the two builds are referred to by their suffix only:
|
| 20 |
+
- `ref_adaln_proj-role_embedding`
|
| 21 |
+
- `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`
|
| 22 |
+
|
| 23 |
+
---
|
| 24 |
+
|
| 25 |
+
## TL;DR
|
| 26 |
+
|
| 27 |
+
| Branch | Where it acts | What it controls | New in `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`? |
|
| 28 |
+
|---|---|---|---|
|
| 29 |
+
| `attn1` LoRA | self-attention inside every block | scene cohesion, structural editing | no (carried over, frozen) |
|
| 30 |
+
| `attn2` LoRA | cross-attention to **text** (Gemma) | prompt following | no (re-trained) |
|
| 31 |
+
| `ff` LoRA | feed-forward MLP | feature mixing / capacity | no (re-trained) |
|
| 32 |
+
| **`ref_attn` LoRA** | dedicated cross-attention to **32 visual memory tokens** | preserving fine-grained appearance of the reference | **yes** |
|
| 33 |
+
| **`ref_visual_proj`** | projects the ref VAE latent into 32 context tokens | the *content* that `ref_attn` attends to | **yes** |
|
| 34 |
+
| `ref_adaln_proj` | produces a global vector added to the timestep AdaLN | overall color/style/identity bias | retrained (new pooling) |
|
| 35 |
+
| `role_embedding` | adds a 128-dim bias to ref tokens in the IC-LoRA sequence | tells the transformer "this token is the reference" | frozen in the continuation |
|
| 36 |
+
|
| 37 |
+
So:
|
| 38 |
+
- `ref_adaln_proj-role_embedding` only had a **slow, global** appearance signal
|
| 39 |
+
(AdaLN) plus the IC-LoRA-style ref tokens packed into the sequence.
|
| 40 |
+
- `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` adds a **fast,
|
| 41 |
+
local** appearance signal (visual cross-attention) that injects the
|
| 42 |
+
reference's actual textures into every block in the 12 β 35 range.
|
| 43 |
+
|
| 44 |
+
---
|
| 45 |
+
|
| 46 |
+
## 1. The 10 modules shared between both builds
|
| 47 |
+
|
| 48 |
+
These cover the full 48-block transformer and were retrained in
|
| 49 |
+
`ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` (except `attn1.*`,
|
| 50 |
+
which is loaded but frozen β see the training freeze policy in the inventory).
|
| 51 |
+
|
| 52 |
+
### `attn1.{q,k,v,out.0}` β self-attention
|
| 53 |
+
|
| 54 |
+
Every transformer block first does self-attention over the latent video
|
| 55 |
+
tokens. The LoRA here adjusts how tokens relate to each other:
|
| 56 |
+
- **structural consistency** of the generated frames,
|
| 57 |
+
- **how strongly the `@reference` IC-LoRA token influences neighboring
|
| 58 |
+
spatial positions**,
|
| 59 |
+
- low-level look (sharpness, contrast).
|
| 60 |
+
|
| 61 |
+
In the `..._ref_attn-ref_visual_proj` build these are frozen on purpose so
|
| 62 |
+
the original priors over motion and structure stay intact. If the inference
|
| 63 |
+
output looks structurally broken (jitter, motion drift, layout collapse),
|
| 64 |
+
you probably misloaded these adapters or the standard LoRA is at the wrong
|
| 65 |
+
strength.
|
| 66 |
+
|
| 67 |
+
### `attn2.{q,k,v,out.0}` β cross-attention to text
|
| 68 |
+
|
| 69 |
+
This is the prompt-following path. The Gemma text embedding is the K/V; the
|
| 70 |
+
video latent is the Q. The LoRA tunes how the prompt drives the edit.
|
| 71 |
+
|
| 72 |
+
- Stronger `attn2` deltas β the model **leans more on the prompt** ("Add
|
| 73 |
+
@reference sleeping on the armrest"). Useful for compositional control.
|
| 74 |
+
- If you disable or weaken the standard LoRA (e.g. `strength_model=0`), the
|
| 75 |
+
base model goes back to ignoring your edit instructions β even if `ref_attn`
|
| 76 |
+
is still active, the prompt-binding is gone.
|
| 77 |
+
|
| 78 |
+
### `ff.net.{0.proj, 2}` β MLP capacity
|
| 79 |
+
|
| 80 |
+
The block's feed-forward part. The LoRA here adds **representational
|
| 81 |
+
capacity** to absorb the new behaviors that prompt + reference impose. There
|
| 82 |
+
is no single user-visible "knob" for this; it works behind the scenes.
|
| 83 |
+
|
| 84 |
+
If you slash its strength you'll see colors and textures drift back toward
|
| 85 |
+
generic LTX-2 outputs.
|
| 86 |
+
|
| 87 |
+
---
|
| 88 |
+
|
| 89 |
+
## 2. The new `ref_attn` branch
|
| 90 |
+
|
| 91 |
+
This is the heart of the change in
|
| 92 |
+
`ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`. Each of the 48
|
| 93 |
+
transformer blocks now has a *fourth* attention head, `ref_attn`, in
|
| 94 |
+
addition to `attn1` (self) and `attn2` (text). `ref_attn` cross-attends from
|
| 95 |
+
the noisy video latent (Q) to **a small set of visual memory tokens
|
| 96 |
+
computed from the reference image** (K/V).
|
| 97 |
+
|
| 98 |
+
### Why three projections (q/k/v/out.0)
|
| 99 |
+
|
| 100 |
+
A standard cross-attention. The base weights are copied from `attn2` at load
|
| 101 |
+
time (`init_ref_attn_from: attn2`) so the module starts as "text cross-attn,
|
| 102 |
+
but pointed at visual tokens"; the LoRA then teaches it to actually
|
| 103 |
+
*use* those visual tokens.
|
| 104 |
+
|
| 105 |
+
### Per-block gating
|
| 106 |
+
|
| 107 |
+
`ref_attn` is only consulted in blocks **12 β 35** (this is what
|
| 108 |
+
`ref_start_block` / `ref_end_block` enforce at inference and what the trainer
|
| 109 |
+
used during fine-tuning). Skipping blocks 0β11 keeps the early low-level
|
| 110 |
+
features untouched; skipping blocks 36β47 lets the late decoding stages do
|
| 111 |
+
their job without extra visual bias.
|
| 112 |
+
|
| 113 |
+
### Impact
|
| 114 |
+
|
| 115 |
+
- **Strong identity preservation** for things the AdaLN anchor can't capture
|
| 116 |
+
(small logos, eye color, fur texture, asymmetric details).
|
| 117 |
+
- Scaled by `ref_context_scale` (training default `0.01`). Small for a
|
| 118 |
+
reason: the visual tokens are dense, and the residual is added on top of
|
| 119 |
+
every block in the 12β35 range β even at 0.01 the cumulative effect is
|
| 120 |
+
meaningful.
|
| 121 |
+
- Doubling the scale (β 0.02) usually intensifies identity at the cost of
|
| 122 |
+
motion fidelity; going to 0.05+ tends to "freeze" parts of the scene to the
|
| 123 |
+
reference appearance.
|
| 124 |
+
- Setting `ref_start_block=0` is **destructive**: blocks 0β11 never saw
|
| 125 |
+
`ref_context` during training, so injecting it there feeds the model
|
| 126 |
+
noise β outputs collapse to black or random patterns.
|
| 127 |
+
|
| 128 |
+
---
|
| 129 |
+
|
| 130 |
+
## 3. The new `ref_visual_proj`
|
| 131 |
+
|
| 132 |
+
This is the *source* of what `ref_attn` attends to. Without it the
|
| 133 |
+
`ref_attn` LoRA is useless β there are no visual tokens to read.
|
| 134 |
+
|
| 135 |
+
### Forward
|
| 136 |
+
|
| 137 |
+
```
|
| 138 |
+
ref_frame = mean over time of the ref VAE latent # [B, 128, H, W]
|
| 139 |
+
local = adaptive_avg_pool to (4, 8) # 32 spatial cells
|
| 140 |
+
global_mean, global_std over the whole frame # 2 Γ 128
|
| 141 |
+
tokens = concat(local, broadcast(mean,std)) # [B, 32, 384]
|
| 142 |
+
tokens = proj(silu(fc1(tokens))) # [B, 32, 4096]
|
| 143 |
+
tokens = LayerNorm(tokens)
|
| 144 |
+
tokens = tokens + pos_embed[:, :32]
|
| 145 |
+
return tokens * token_scale # 0.25 in training
|
| 146 |
+
```
|
| 147 |
+
|
| 148 |
+
### Layer-by-layer impact
|
| 149 |
+
|
| 150 |
+
| Tensor | What it controls | If perturbed |
|
| 151 |
+
|---|---|---|
|
| 152 |
+
| `fc1.weight / bias` (1024Γ384) | maps the 384-dim raw appearance descriptor into the projector's hidden space | weights here decide *which* aspects of the pooled appearance survive (e.g. color vs. texture vs. luminance) |
|
| 153 |
+
| `proj.weight / bias` (4096Γ1024) | lifts the hidden vector into the transformer context dim | initialized with small gain (0.05) so the branch starts almost-no-op; loaded from training |
|
| 154 |
+
| `norm.weight / bias` (4096) | LayerNorm on the projected tokens | keeps numerical range consistent across reference images so `ref_attn` works at the same scale regardless of input statistics |
|
| 155 |
+
| `pos_embed` (1, 32, 4096) | per-position bias for the 32 memory tokens | the model uses this to distinguish "top-left cell" from "bottom-right cell" β without it, all 32 tokens would be permutation-invariant and `ref_attn` would degenerate |
|
| 156 |
+
|
| 157 |
+
### `ref_token_scale` (training = 0.25)
|
| 158 |
+
|
| 159 |
+
This is the runtime multiplier on the output. It is **not** a stored tensor
|
| 160 |
+
but a knob in the inference node. Doubling it (β 0.5) effectively doubles
|
| 161 |
+
the K/V magnitude that `ref_attn` reads, which biases attention scores
|
| 162 |
+
toward the reference tokens. Combined with `ref_context_scale`, you have
|
| 163 |
+
two independent ways to over-/under-amplify the visual reference branch.
|
| 164 |
+
|
| 165 |
+
---
|
| 166 |
+
|
| 167 |
+
## 4. `ref_adaln_proj` β *retrained, not continued*
|
| 168 |
+
|
| 169 |
+
Both builds have this projector, but **the input dimension changed**:
|
| 170 |
+
|
| 171 |
+
| | `ref_adaln_proj-role_embedding` | `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` |
|
| 172 |
+
|---|---|---|
|
| 173 |
+
| Pooling | `avg_1x1 β max_1x1` (2-scale) | `avg_1x1 β avg_2x2 β max_1x1` (3-scale) |
|
| 174 |
+
| `fc1.weight` shape | (512, **256**) | (512, **768**) |
|
| 175 |
+
|
| 176 |
+
Because of the shape mismatch the trainer **reinitializes** `ref_adaln_proj`
|
| 177 |
+
from scratch when continuing from `ref_adaln_proj-role_embedding`. The
|
| 178 |
+
`ref_adaln_proj` in the continuation is not a fine-tune of the original β it
|
| 179 |
+
learned fresh. wandb confirms this: `ref_proj/weight_norm` ramps from
|
| 180 |
+
near-zero to ~2.9.
|
| 181 |
+
|
| 182 |
+
### What it actually does
|
| 183 |
+
|
| 184 |
+
Builds one **per-sample** vector that is **added to the timestep bias** fed
|
| 185 |
+
into every transformer block's AdaLN layer. The result: a persistent,
|
| 186 |
+
sample-wide "lean toward this reference" applied throughout denoising.
|
| 187 |
+
|
| 188 |
+
### Why this is the *complement* of `ref_attn`
|
| 189 |
+
|
| 190 |
+
- `ref_attn` is **localized**: visual tokens cross-attend per spatial cell,
|
| 191 |
+
letting the model copy fine details.
|
| 192 |
+
- `ref_adaln_proj` is **global**: a single conditioning vector tints all 48
|
| 193 |
+
blocks uniformly. Best for "the overall look of the output should remind
|
| 194 |
+
me of this reference" (palette, lighting, broad style).
|
| 195 |
+
|
| 196 |
+
### `adaln_scale` (training = 2.0)
|
| 197 |
+
|
| 198 |
+
The user-side multiplier. At training default 2.0, AdaLN is doing a lot of
|
| 199 |
+
the appearance lifting. Common failure modes:
|
| 200 |
+
|
| 201 |
+
- **`adaln_scale=0`**: model ignores the reference's global look; you keep
|
| 202 |
+
only what `ref_attn` and the IC-LoRA tokens can recover. Expect washed-out
|
| 203 |
+
identity.
|
| 204 |
+
- **`adaln_scale=1.0`** (ComfyUI default before the recent realignment):
|
| 205 |
+
exactly half the training-time strength. Identity is still recognizable
|
| 206 |
+
but visibly weaker.
|
| 207 |
+
- **`adaln_scale>3`**: identity dominates and the model starts ignoring the
|
| 208 |
+
prompt / guide motion.
|
| 209 |
+
|
| 210 |
+
---
|
| 211 |
+
|
| 212 |
+
## 5. `role_embedding` β present in both, behavior depends on which you load
|
| 213 |
+
|
| 214 |
+
A learned `[1, 128]` vector that **adds a fingerprint** to the patchified
|
| 215 |
+
tokens belonging to the IC-LoRA reference image, so the transformer can tell
|
| 216 |
+
the ref token apart from generic guide / target tokens.
|
| 217 |
+
|
| 218 |
+
### In `ref_adaln_proj-role_embedding`
|
| 219 |
+
Was trained with `use_visual_ref_role_embedding=True` β that's where the
|
| 220 |
+
non-zero value (~0.125 norm) comes from. The `attn1`/`attn2` adapters in
|
| 221 |
+
this build therefore learned to *recognize* this bias.
|
| 222 |
+
|
| 223 |
+
### In `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`
|
| 224 |
+
Inherits the value from `ref_adaln_proj-role_embedding` but trains with
|
| 225 |
+
`use_visual_ref_role_embedding=False`, meaning the bias **is never added
|
| 226 |
+
during training**. The vector is frozen at its inherited value; wandb shows
|
| 227 |
+
its norm flat at 0.125 across the whole run.
|
| 228 |
+
|
| 229 |
+
### Inference rule
|
| 230 |
+
|
| 231 |
+
When loading `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`: keep
|
| 232 |
+
**`enable_role_embedding=False`**. Turning it on adds a bias to the ref
|
| 233 |
+
tokens that this build never saw β the `attn1`/`attn2` adapters retrained
|
| 234 |
+
without it, so the bias becomes adversarial noise and degrades the output.
|
| 235 |
+
|
| 236 |
+
When loading `ref_adaln_proj-role_embedding` directly (no
|
| 237 |
+
`..._ref_attn-ref_visual_proj` adapters), the opposite is true:
|
| 238 |
+
`enable_role_embedding=True` matches the training distribution.
|
| 239 |
+
|
| 240 |
+
---
|
| 241 |
+
|
| 242 |
+
## 6. Quick reference: what each knob does at inference
|
| 243 |
+
|
| 244 |
+
| Knob | `..._ref_attn-ref_visual_proj` training value | Effect of raising it | Effect of lowering it |
|
| 245 |
+
|---|---|---|---|
|
| 246 |
+
| `adaln_scale` | 2.0 | stronger global look | identity fades |
|
| 247 |
+
| `ref_context_scale` | 0.01 | sharper fine-grained ID; can over-freeze | local detail blurs back to base |
|
| 248 |
+
| `ref_token_scale` | 0.25 | more "voice" for the visual tokens in attention | `ref_attn` becomes a no-op |
|
| 249 |
+
| `ref_start_block` / `ref_end_block` | 12 / 35 | (do not change) | (do not change) β outside this range the LoRA is untrained |
|
| 250 |
+
| `enable_role_embedding` | False | adds out-of-distribution bias to ref tokens | matches training |
|
| 251 |
+
| `role_strength` | n/a | only matters if `enable_role_embedding=True` | |
|
| 252 |
+
| Standard LoRA `strength_model` | 1.0 | over-fits to training distribution | drifts back toward base LTX-2 |
|
| 253 |
+
|
| 254 |
+
The combination that mirrors training of the
|
| 255 |
+
`..._ref_attn-ref_visual_proj` build exactly: `adaln_scale=2.0,
|
| 256 |
+
ref_context_scale=0.01, ref_token_scale=0.25, ref_start_block=12,
|
| 257 |
+
ref_end_block=35, enable_role_embedding=False, ref_init_from=attn2,
|
| 258 |
+
strength_model=1.0`.
|
| 259 |
+
|
| 260 |
+
---
|
| 261 |
+
|
| 262 |
+
## 7. Where the loaded files come from
|
| 263 |
+
|
| 264 |
+
`scripts/split_editanything_lora.py` produces two safetensors per checkpoint.
|
| 265 |
+
The filename suffix lists every extra that ended up in the module sidecar
|
| 266 |
+
(fixed order: `ref_adaln_proj`, `role_embedding`, `ref_attn`,
|
| 267 |
+
`ref_visual_proj`), so you can tell which mechanisms each pair carries
|
| 268 |
+
without opening the file.
|
| 269 |
+
|
| 270 |
+
Canonical pairs:
|
| 271 |
+
|
| 272 |
+
```
|
| 273 |
+
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.standard.safetensors
|
| 274 |
+
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.module.safetensors
|
| 275 |
+
|
| 276 |
+
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.standard.safetensors
|
| 277 |
+
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.module.safetensors
|
| 278 |
+
```
|
| 279 |
+
|
| 280 |
+
Feed the `.standard.*` into ComfyUI's standard LoRA loader and the
|
| 281 |
+
`.module.*` into `LTXVEditAnythingModuleLoader`. Mixing pairs across builds
|
| 282 |
+
(e.g., `ref_adaln_proj-role_embedding.standard.*` with
|
| 283 |
+
`..._ref_attn-ref_visual_proj.module.*`) is not supported β the LoRA deltas
|
| 284 |
+
were trained against the partner adapters in the same build.
|
lora_layers_reference.md
ADDED
|
@@ -0,0 +1,196 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# LoRA Layer Inventory β Edit Anything checkpoints
|
| 2 |
+
|
| 3 |
+
Inventory of every tensor in two builds of the
|
| 4 |
+
`edit_anything_reference_v0.1_r128` LoRA.
|
| 5 |
+
|
| 6 |
+
Both builds share the same canonical basename
|
| 7 |
+
(`edit_anything_reference_v0.1_r128`) and are distinguished by the **extras
|
| 8 |
+
suffix** that `scripts/split_editanything_lora.py` appends to the output
|
| 9 |
+
filenames:
|
| 10 |
+
|
| 11 |
+
- `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.{standard,module}.safetensors`
|
| 12 |
+
β the original build. Only ships `ref_adaln_proj` + `role_embedding`.
|
| 13 |
+
- `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.{standard,module}.safetensors`
|
| 14 |
+
β the continuation, fine-tuned with the
|
| 15 |
+
`video_to_video_ref_visual_adaln` strategy. Adds the `ref_attn` LoRA
|
| 16 |
+
branch and the `ref_visual_proj` projector on top of the original
|
| 17 |
+
extras.
|
| 18 |
+
|
| 19 |
+
In the rest of this doc the two are referred to by their suffix only:
|
| 20 |
+
- `ref_adaln_proj-role_embedding`
|
| 21 |
+
- `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`
|
| 22 |
+
|
| 23 |
+
Rank is 128 in both (encoded in the LoRA tensor shapes; no `alpha` keys saved).
|
| 24 |
+
Dtype is `bfloat16` throughout. All LoRA modules cover **48 transformer blocks**.
|
| 25 |
+
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
## 1. Summary
|
| 29 |
+
|
| 30 |
+
| | `ref_adaln_proj-role_embedding` | `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` |
|
| 31 |
+
|---|---|---|
|
| 32 |
+
| Total tensors | 965 | 1356 |
|
| 33 |
+
| LoRA-target modules | **10** | **14** |
|
| 34 |
+
| LoRA tensors (A+B) | 960 | 1344 |
|
| 35 |
+
| Extra (non-LoRA) tensors | 5 | 12 |
|
| 36 |
+
| `ref_attn` LoRA branch | β absent | β
trained on 48 blocks |
|
| 37 |
+
| `ref_visual_proj` (visual cross-attn projector) | β absent | β
present (7 tensors) |
|
| 38 |
+
| `ref_adaln_proj` (global appearance AdaLN) | β
(fc1 input dim **256**) | β
(fc1 input dim **768**) |
|
| 39 |
+
| `role_embedding` | β
shape (1, 128) | β
shape (1, 128) |
|
| 40 |
+
|
| 41 |
+
---
|
| 42 |
+
|
| 43 |
+
## 2. LoRA adapters
|
| 44 |
+
|
| 45 |
+
Each row = one target module type. Each entry = (`lora_A.weight`, `lora_B.weight`)
|
| 46 |
+
duplicated across the 48 blocks of `diffusion_model.transformer_blocks.*`.
|
| 47 |
+
|
| 48 |
+
| Module | `ref_adaln_proj-role_embedding` | `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` | Notes |
|
| 49 |
+
|---|:---:|:---:|---|
|
| 50 |
+
| `attn1.to_q` | β
| β
| self-attention query |
|
| 51 |
+
| `attn1.to_k` | β
| β
| self-attention key |
|
| 52 |
+
| `attn1.to_v` | β
| β
| self-attention value |
|
| 53 |
+
| `attn1.to_out.0` | β
| β
| self-attention output proj |
|
| 54 |
+
| `attn2.to_q` | β
| β
| cross-attention to text (Gemma) |
|
| 55 |
+
| `attn2.to_k` | β
| β
| |
|
| 56 |
+
| `attn2.to_v` | β
| β
| |
|
| 57 |
+
| `attn2.to_out.0` | β
| β
| |
|
| 58 |
+
| `ff.net.0.proj` | β
| β
| feed-forward up-projection |
|
| 59 |
+
| `ff.net.2` | β
| β
| feed-forward down-projection |
|
| 60 |
+
| `ref_attn.to_q` | β | β
| **new** β visual reference cross-attention |
|
| 61 |
+
| `ref_attn.to_k` | β | β
| **new** |
|
| 62 |
+
| `ref_attn.to_v` | β | β
| **new** |
|
| 63 |
+
| `ref_attn.to_out.0` | β | β
| **new** |
|
| 64 |
+
|
| 65 |
+
**Key naming**: `diffusion_model.transformer_blocks.{0..47}.{module}.{lora_A|lora_B}.weight`
|
| 66 |
+
|
| 67 |
+
**Training freeze policy** for the
|
| 68 |
+
`ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` build
|
| 69 |
+
(per `stage2_ref_visual_adaln_crossattn_from_v01_r128.yaml`):
|
| 70 |
+
- `attn1.*` adapters loaded from the `ref_adaln_proj-role_embedding` build
|
| 71 |
+
but **frozen** (`trainable_include_patterns` excludes them).
|
| 72 |
+
- `attn2.*`, `ff.*`, `ref_attn.*` are trainable.
|
| 73 |
+
|
| 74 |
+
---
|
| 75 |
+
|
| 76 |
+
## 3. Non-LoRA modules (the module sidecar)
|
| 77 |
+
|
| 78 |
+
These tensors live at the top of the state dict (no `transformer_blocks.*` prefix)
|
| 79 |
+
and are consumed by the custom inference path (`LTXVEditAnythingModuleLoader` +
|
| 80 |
+
`LTXVEditAnythingLoopingSampler`), not by the standard ComfyUI LoRA loader.
|
| 81 |
+
|
| 82 |
+
### 3.1. `role_embedding` β appearance role bias
|
| 83 |
+
|
| 84 |
+
| Key | Shape | Notes |
|
| 85 |
+
|---|---|---|
|
| 86 |
+
| `role_embedding.embedding.weight` | (1, 128) | 1 slot (appearance). Padded to (3, 128) at inference; entry stored at slot 1 (ref_img role). |
|
| 87 |
+
|
| 88 |
+
Present in **both** builds with the same shape. In the
|
| 89 |
+
`ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` build it is
|
| 90 |
+
**frozen** (`use_visual_ref_role_embedding: false`); wandb shows its norm
|
| 91 |
+
stays flat at ~0.125 throughout training.
|
| 92 |
+
|
| 93 |
+
### 3.2. `ref_adaln_proj` β global AdaLN appearance anchor
|
| 94 |
+
|
| 95 |
+
Two-layer MLP that pools the reference latent into a vector added to every
|
| 96 |
+
block's AdaLN timestep bias.
|
| 97 |
+
|
| 98 |
+
| Key | `ref_adaln_proj-role_embedding` shape | `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` shape |
|
| 99 |
+
|---|---|---|
|
| 100 |
+
| `ref_adaln_proj.fc1.weight` | (512, **256**) | (512, **768**) |
|
| 101 |
+
| `ref_adaln_proj.fc1.bias` | (512,) | (512,) |
|
| 102 |
+
| `ref_adaln_proj.proj.weight` | (36864, 512) | (36864, 512) |
|
| 103 |
+
| `ref_adaln_proj.proj.bias` | (36864,) | (36864,) |
|
| 104 |
+
|
| 105 |
+
> β οΈ **Shape mismatch on `fc1.weight`**.
|
| 106 |
+
> The `ref_adaln_proj-role_embedding` build was trained with a 2-scale pool
|
| 107 |
+
> (`avg_1x1 β max_1x1` β 256-dim input).
|
| 108 |
+
> The `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` build was
|
| 109 |
+
> trained with a 3-scale pool (`avg_1x1 β avg_2x2 β max_1x1` β 768-dim).
|
| 110 |
+
> Because of this incompatibility the trainer **reinitializes**
|
| 111 |
+
> `ref_adaln_proj` from scratch when continuing from
|
| 112 |
+
> `ref_adaln_proj-role_embedding`; the AdaLN projector in the continuation
|
| 113 |
+
> is **not** a fine-tune of the original one. The output dim 36864 = AdaLN
|
| 114 |
+
> param count for the LTX-2 transformer (read at runtime via
|
| 115 |
+
> `preprocessor.adaln.linear.out_features`).
|
| 116 |
+
|
| 117 |
+
### 3.3. `ref_visual_proj` β visual cross-attention memory tokens
|
| 118 |
+
|
| 119 |
+
Present in `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` only.
|
| 120 |
+
`SafeVisualRefProjector` (training file `video_to_video_ref_visual.py`).
|
| 121 |
+
Produces 32 visual memory tokens consumed by the new `ref_attn` branch.
|
| 122 |
+
|
| 123 |
+
| Key | Shape | Notes |
|
| 124 |
+
|---|---|---|
|
| 125 |
+
| `ref_visual_proj.fc1.weight` | (1024, **384**) | input 384 = 128 (local pooled) + 128 (global mean) + 128 (global std) |
|
| 126 |
+
| `ref_visual_proj.fc1.bias` | (1024,) | xavier init gain 0.1 |
|
| 127 |
+
| `ref_visual_proj.proj.weight` | (4096, 1024) | maps to context_dim 4096; xavier init gain 0.05 |
|
| 128 |
+
| `ref_visual_proj.proj.bias` | (4096,) | |
|
| 129 |
+
| `ref_visual_proj.norm.weight` | (4096,) | LayerNorm Ξ³ |
|
| 130 |
+
| `ref_visual_proj.norm.bias` | (4096,) | LayerNorm Ξ² |
|
| 131 |
+
| `ref_visual_proj.pos_embed` | (1, 32, 4096) | per-token learned positional bias |
|
| 132 |
+
|
| 133 |
+
Forward (matches `SafeVisualRefProjector.forward`):
|
| 134 |
+
```
|
| 135 |
+
tokens = local β global_mean β global_std # [B, 32, 384]
|
| 136 |
+
tokens = proj(silu(fc1(tokens))) # β [B, 32, 4096]
|
| 137 |
+
tokens = LayerNorm(tokens)
|
| 138 |
+
tokens = tokens + pos_embed[:, :32]
|
| 139 |
+
return tokens * token_scale # training default 0.25
|
| 140 |
+
```
|
| 141 |
+
|
| 142 |
+
Not present in `ref_adaln_proj-role_embedding` β this entire branch is new.
|
| 143 |
+
|
| 144 |
+
---
|
| 145 |
+
|
| 146 |
+
## 4. Total tensor counts (sanity check)
|
| 147 |
+
|
| 148 |
+
### `ref_adaln_proj-role_embedding`
|
| 149 |
+
```
|
| 150 |
+
LoRA: 10 modules Γ 48 blocks Γ 2 (A,B) = 960
|
| 151 |
+
ref_adaln_proj: 4 (fc1.{w,b}, proj.{w,b}) = 4
|
| 152 |
+
role_embedding: 1 = 1
|
| 153 |
+
total= 965 β
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
### `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`
|
| 157 |
+
```
|
| 158 |
+
LoRA: 14 modules Γ 48 blocks Γ 2 (A,B) = 1344
|
| 159 |
+
ref_adaln_proj: 4 = 4
|
| 160 |
+
ref_visual_proj: 7 = 7
|
| 161 |
+
role_embedding: 1 = 1
|
| 162 |
+
total= 1356 β
|
| 163 |
+
```
|
| 164 |
+
|
| 165 |
+
---
|
| 166 |
+
|
| 167 |
+
## 5. Loading checkpoint at inference
|
| 168 |
+
|
| 169 |
+
Use `scripts/split_editanything_lora.py` to split each raw training
|
| 170 |
+
checkpoint into:
|
| 171 |
+
- `*.standard.safetensors` β LoRA on `attn1/attn2/ff` only; safe to feed to
|
| 172 |
+
ComfyUI's standard LoraLoader.
|
| 173 |
+
- `*.module.safetensors` β everything else (`role_embedding`,
|
| 174 |
+
`ref_adaln_proj`, `ref_visual_proj`, `ref_attn` LoRA adapters); feed to
|
| 175 |
+
`LTXVEditAnythingModuleLoader`.
|
| 176 |
+
|
| 177 |
+
The filename suffix lists every extra that ended up in the module sidecar,
|
| 178 |
+
so it is obvious at a glance which mechanisms a given pair carries. Order is
|
| 179 |
+
fixed: `ref_adaln_proj`, `role_embedding`, `ref_attn`, `ref_visual_proj`.
|
| 180 |
+
|
| 181 |
+
### Canonical output names
|
| 182 |
+
|
| 183 |
+
```
|
| 184 |
+
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.standard.safetensors
|
| 185 |
+
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.module.safetensors
|
| 186 |
+
|
| 187 |
+
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.standard.safetensors
|
| 188 |
+
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.module.safetensors
|
| 189 |
+
```
|
| 190 |
+
|
| 191 |
+
### Command
|
| 192 |
+
|
| 193 |
+
```bash
|
| 194 |
+
python3 /data/training/ltx-edit-trainer/scripts/split_editanything_lora.py \
|
| 195 |
+
<raw-checkpoint>.safetensors --output-dir <dir> [--overwrite]
|
| 196 |
+
```
|