Instructions to use Alissonerdx/EditAnything with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use Alissonerdx/EditAnything with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("Lightricks/LTX-2.3", dtype=torch.bfloat16, device_map="cuda") pipe.load_lora_weights("Alissonerdx/EditAnything") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Draw Things
| # Functional differences between the two builds and what each layer does | |
| Companion to `lora_layers_reference.md`. That file is the inventory; this one | |
| explains the **functional role** of every group of tensors and the **expected | |
| behavioral impact** of toggling each branch at inference. | |
| Two builds of the `edit_anything_reference_v0.1_r128` LoRA exist, each | |
| delivered as a `(.standard, .module)` pair. The pairs are distinguished by | |
| their **extras suffix**: | |
| - `..._ref_adaln_proj-role_embedding.{standard,module}.safetensors` β the | |
| original build. One mechanism for steering the model toward the reference | |
| image: **global AdaLN appearance anchoring** (plus the IC-LoRA-style ref | |
| tokens packed into the sequence). | |
| - `..._ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.{standard,module}.safetensors` | |
| β the continuation. Keeps everything from the original build and adds | |
| **two new mechanisms** that operate on different time/space scales. | |
| In the rest of this doc the two builds are referred to by their suffix only: | |
| - `ref_adaln_proj-role_embedding` | |
| - `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` | |
| --- | |
| ## TL;DR | |
| | Branch | Where it acts | What it controls | New in `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`? | | |
| |---|---|---|---| | |
| | `attn1` LoRA | self-attention inside every block | scene cohesion, structural editing | no (carried over, frozen) | | |
| | `attn2` LoRA | cross-attention to **text** (Gemma) | prompt following | no (re-trained) | | |
| | `ff` LoRA | feed-forward MLP | feature mixing / capacity | no (re-trained) | | |
| | **`ref_attn` LoRA** | dedicated cross-attention to **32 visual memory tokens** | preserving fine-grained appearance of the reference | **yes** | | |
| | **`ref_visual_proj`** | projects the ref VAE latent into 32 context tokens | the *content* that `ref_attn` attends to | **yes** | | |
| | `ref_adaln_proj` | produces a global vector added to the timestep AdaLN | overall color/style/identity bias | retrained (new pooling) | | |
| | `role_embedding` | adds a 128-dim bias to ref tokens in the IC-LoRA sequence | tells the transformer "this token is the reference" | frozen in the continuation | | |
| So: | |
| - `ref_adaln_proj-role_embedding` only had a **slow, global** appearance signal | |
| (AdaLN) plus the IC-LoRA-style ref tokens packed into the sequence. | |
| - `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` adds a **fast, | |
| local** appearance signal (visual cross-attention) that injects the | |
| reference's actual textures into every block in the 12 β 35 range. | |
| --- | |
| ## 1. The 10 modules shared between both builds | |
| These cover the full 48-block transformer and were retrained in | |
| `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` (except `attn1.*`, | |
| which is loaded but frozen β see the training freeze policy in the inventory). | |
| ### `attn1.{q,k,v,out.0}` β self-attention | |
| Every transformer block first does self-attention over the latent video | |
| tokens. The LoRA here adjusts how tokens relate to each other: | |
| - **structural consistency** of the generated frames, | |
| - **how strongly the `@reference` IC-LoRA token influences neighboring | |
| spatial positions**, | |
| - low-level look (sharpness, contrast). | |
| In the `..._ref_attn-ref_visual_proj` build these are frozen on purpose so | |
| the original priors over motion and structure stay intact. If the inference | |
| output looks structurally broken (jitter, motion drift, layout collapse), | |
| you probably misloaded these adapters or the standard LoRA is at the wrong | |
| strength. | |
| ### `attn2.{q,k,v,out.0}` β cross-attention to text | |
| This is the prompt-following path. The Gemma text embedding is the K/V; the | |
| video latent is the Q. The LoRA tunes how the prompt drives the edit. | |
| - Stronger `attn2` deltas β the model **leans more on the prompt** ("Add | |
| @reference sleeping on the armrest"). Useful for compositional control. | |
| - If you disable or weaken the standard LoRA (e.g. `strength_model=0`), the | |
| base model goes back to ignoring your edit instructions β even if `ref_attn` | |
| is still active, the prompt-binding is gone. | |
| ### `ff.net.{0.proj, 2}` β MLP capacity | |
| The block's feed-forward part. The LoRA here adds **representational | |
| capacity** to absorb the new behaviors that prompt + reference impose. There | |
| is no single user-visible "knob" for this; it works behind the scenes. | |
| If you slash its strength you'll see colors and textures drift back toward | |
| generic LTX-2 outputs. | |
| --- | |
| ## 2. The new `ref_attn` branch | |
| This is the heart of the change in | |
| `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`. Each of the 48 | |
| transformer blocks now has a *fourth* attention head, `ref_attn`, in | |
| addition to `attn1` (self) and `attn2` (text). `ref_attn` cross-attends from | |
| the noisy video latent (Q) to **a small set of visual memory tokens | |
| computed from the reference image** (K/V). | |
| ### Why three projections (q/k/v/out.0) | |
| A standard cross-attention. The base weights are copied from `attn2` at load | |
| time (`init_ref_attn_from: attn2`) so the module starts as "text cross-attn, | |
| but pointed at visual tokens"; the LoRA then teaches it to actually | |
| *use* those visual tokens. | |
| ### Per-block gating | |
| `ref_attn` is only consulted in blocks **12 β 35** (this is what | |
| `ref_start_block` / `ref_end_block` enforce at inference and what the trainer | |
| used during fine-tuning). Skipping blocks 0β11 keeps the early low-level | |
| features untouched; skipping blocks 36β47 lets the late decoding stages do | |
| their job without extra visual bias. | |
| ### Impact | |
| - **Strong identity preservation** for things the AdaLN anchor can't capture | |
| (small logos, eye color, fur texture, asymmetric details). | |
| - Scaled by `ref_context_scale` (training default `0.01`). Small for a | |
| reason: the visual tokens are dense, and the residual is added on top of | |
| every block in the 12β35 range β even at 0.01 the cumulative effect is | |
| meaningful. | |
| - Doubling the scale (β 0.02) usually intensifies identity at the cost of | |
| motion fidelity; going to 0.05+ tends to "freeze" parts of the scene to the | |
| reference appearance. | |
| - Setting `ref_start_block=0` is **destructive**: blocks 0β11 never saw | |
| `ref_context` during training, so injecting it there feeds the model | |
| noise β outputs collapse to black or random patterns. | |
| --- | |
| ## 3. The new `ref_visual_proj` | |
| This is the *source* of what `ref_attn` attends to. Without it the | |
| `ref_attn` LoRA is useless β there are no visual tokens to read. | |
| ### Forward | |
| ``` | |
| ref_frame = mean over time of the ref VAE latent # [B, 128, H, W] | |
| local = adaptive_avg_pool to (4, 8) # 32 spatial cells | |
| global_mean, global_std over the whole frame # 2 Γ 128 | |
| tokens = concat(local, broadcast(mean,std)) # [B, 32, 384] | |
| tokens = proj(silu(fc1(tokens))) # [B, 32, 4096] | |
| tokens = LayerNorm(tokens) | |
| tokens = tokens + pos_embed[:, :32] | |
| return tokens * token_scale # 0.25 in training | |
| ``` | |
| ### Layer-by-layer impact | |
| | Tensor | What it controls | If perturbed | | |
| |---|---|---| | |
| | `fc1.weight / bias` (1024Γ384) | maps the 384-dim raw appearance descriptor into the projector's hidden space | weights here decide *which* aspects of the pooled appearance survive (e.g. color vs. texture vs. luminance) | | |
| | `proj.weight / bias` (4096Γ1024) | lifts the hidden vector into the transformer context dim | initialized with small gain (0.05) so the branch starts almost-no-op; loaded from training | | |
| | `norm.weight / bias` (4096) | LayerNorm on the projected tokens | keeps numerical range consistent across reference images so `ref_attn` works at the same scale regardless of input statistics | | |
| | `pos_embed` (1, 32, 4096) | per-position bias for the 32 memory tokens | the model uses this to distinguish "top-left cell" from "bottom-right cell" β without it, all 32 tokens would be permutation-invariant and `ref_attn` would degenerate | | |
| ### `ref_token_scale` (training = 0.25) | |
| This is the runtime multiplier on the output. It is **not** a stored tensor | |
| but a knob in the inference node. Doubling it (β 0.5) effectively doubles | |
| the K/V magnitude that `ref_attn` reads, which biases attention scores | |
| toward the reference tokens. Combined with `ref_context_scale`, you have | |
| two independent ways to over-/under-amplify the visual reference branch. | |
| --- | |
| ## 4. `ref_adaln_proj` β *retrained, not continued* | |
| Both builds have this projector, but **the input dimension changed**: | |
| | | `ref_adaln_proj-role_embedding` | `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` | | |
| |---|---|---| | |
| | Pooling | `avg_1x1 β max_1x1` (2-scale) | `avg_1x1 β avg_2x2 β max_1x1` (3-scale) | | |
| | `fc1.weight` shape | (512, **256**) | (512, **768**) | | |
| Because of the shape mismatch the trainer **reinitializes** `ref_adaln_proj` | |
| from scratch when continuing from `ref_adaln_proj-role_embedding`. The | |
| `ref_adaln_proj` in the continuation is not a fine-tune of the original β it | |
| learned fresh. wandb confirms this: `ref_proj/weight_norm` ramps from | |
| near-zero to ~2.9. | |
| ### What it actually does | |
| Builds one **per-sample** vector that is **added to the timestep bias** fed | |
| into every transformer block's AdaLN layer. The result: a persistent, | |
| sample-wide "lean toward this reference" applied throughout denoising. | |
| ### Why this is the *complement* of `ref_attn` | |
| - `ref_attn` is **localized**: visual tokens cross-attend per spatial cell, | |
| letting the model copy fine details. | |
| - `ref_adaln_proj` is **global**: a single conditioning vector tints all 48 | |
| blocks uniformly. Best for "the overall look of the output should remind | |
| me of this reference" (palette, lighting, broad style). | |
| ### `adaln_scale` (training = 2.0) | |
| The user-side multiplier. At training default 2.0, AdaLN is doing a lot of | |
| the appearance lifting. Common failure modes: | |
| - **`adaln_scale=0`**: model ignores the reference's global look; you keep | |
| only what `ref_attn` and the IC-LoRA tokens can recover. Expect washed-out | |
| identity. | |
| - **`adaln_scale=1.0`** (ComfyUI default before the recent realignment): | |
| exactly half the training-time strength. Identity is still recognizable | |
| but visibly weaker. | |
| - **`adaln_scale>3`**: identity dominates and the model starts ignoring the | |
| prompt / guide motion. | |
| --- | |
| ## 5. `role_embedding` β present in both, behavior depends on which you load | |
| A learned `[1, 128]` vector that **adds a fingerprint** to the patchified | |
| tokens belonging to the IC-LoRA reference image, so the transformer can tell | |
| the ref token apart from generic guide / target tokens. | |
| ### In `ref_adaln_proj-role_embedding` | |
| Was trained with `use_visual_ref_role_embedding=True` β that's where the | |
| non-zero value (~0.125 norm) comes from. The `attn1`/`attn2` adapters in | |
| this build therefore learned to *recognize* this bias. | |
| ### In `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` | |
| Inherits the value from `ref_adaln_proj-role_embedding` but trains with | |
| `use_visual_ref_role_embedding=False`, meaning the bias **is never added | |
| during training**. The vector is frozen at its inherited value; wandb shows | |
| its norm flat at 0.125 across the whole run. | |
| ### Inference rule | |
| When loading `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`: keep | |
| **`enable_role_embedding=False`**. Turning it on adds a bias to the ref | |
| tokens that this build never saw β the `attn1`/`attn2` adapters retrained | |
| without it, so the bias becomes adversarial noise and degrades the output. | |
| When loading `ref_adaln_proj-role_embedding` directly (no | |
| `..._ref_attn-ref_visual_proj` adapters), the opposite is true: | |
| `enable_role_embedding=True` matches the training distribution. | |
| --- | |
| ## 6. Quick reference: what each knob does at inference | |
| | Knob | `..._ref_attn-ref_visual_proj` training value | Effect of raising it | Effect of lowering it | | |
| |---|---|---|---| | |
| | `adaln_scale` | 2.0 | stronger global look | identity fades | | |
| | `ref_context_scale` | 0.01 | sharper fine-grained ID; can over-freeze | local detail blurs back to base | | |
| | `ref_token_scale` | 0.25 | more "voice" for the visual tokens in attention | `ref_attn` becomes a no-op | | |
| | `ref_start_block` / `ref_end_block` | 12 / 35 | (do not change) | (do not change) β outside this range the LoRA is untrained | | |
| | `enable_role_embedding` | False | adds out-of-distribution bias to ref tokens | matches training | | |
| | `role_strength` | n/a | only matters if `enable_role_embedding=True` | | | |
| | Standard LoRA `strength_model` | 1.0 | over-fits to training distribution | drifts back toward base LTX-2 | | |
| The combination that mirrors training of the | |
| `..._ref_attn-ref_visual_proj` build exactly: `adaln_scale=2.0, | |
| ref_context_scale=0.01, ref_token_scale=0.25, ref_start_block=12, | |
| ref_end_block=35, enable_role_embedding=False, ref_init_from=attn2, | |
| strength_model=1.0`. | |
| --- | |
| ## 7. Where the loaded files come from | |
| `scripts/split_editanything_lora.py` produces two safetensors per checkpoint. | |
| The filename suffix lists every extra that ended up in the module sidecar | |
| (fixed order: `ref_adaln_proj`, `role_embedding`, `ref_attn`, | |
| `ref_visual_proj`), so you can tell which mechanisms each pair carries | |
| without opening the file. | |
| Canonical pairs: | |
| ``` | |
| edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.standard.safetensors | |
| edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.module.safetensors | |
| edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.standard.safetensors | |
| edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.module.safetensors | |
| ``` | |
| Feed the `.standard.*` into ComfyUI's standard LoRA loader and the | |
| `.module.*` into `LTXVEditAnythingModuleLoader`. Mixing pairs across builds | |
| (e.g., `ref_adaln_proj-role_embedding.standard.*` with | |
| `..._ref_attn-ref_visual_proj.module.*`) is not supported β the LoRA deltas | |
| were trained against the partner adapters in the same build. | |