EditAnything / README.md
Alissonerdx's picture
Update README.md
150fc4a verified
---
license: apache-2.0
library_name: diffusers
base_model: Lightricks/LTX-2.3
tags:
- lora
- video
- video-editing
- ltx-2.3
---
# Edit Anything β€” Experimental LTX-2 Video Editing LoRAs
> **Heads up.** These LoRAs are research experiments. They are far from
> production-ready and will fail on many inputs. They are released for the
> community to play with and break, not as a finished tool.
This repository hosts three unrelated training tracks built on top of
**LTX-2.3 (22B)** for video editing:
1. **Edit Anything v0.1 β€” motion transfer LoRA** (two ranks).
2. **Edit Anything β€” no-reference multitask LoRA** (rank 256, prompt-driven only).
3. **Reference video-to-video (Ref V2V) β€” experimental IC-LoRA + sidecar modules** (two builds).
Inference is meant to run through the **BFSnodes** ComfyUI custom nodes β€”
the Ref V2V build in particular needs them to load the sidecar modules and
install the custom branches into the transformer.
---
## 1. Edit Anything v0.1 (motion transfer)
Files:
- `edit_anything_30k_v0.1_motion_transfer_r128.safetensors`
- `edit_anything_30k_v0.1_motion_transfer_r256.safetensors`
### What it is
**v0.1 is not a direct continuation of v1.0.** It was trained from scratch
in two stages:
1. **Stage 1 β€” image-only pretraining.** ~30 000 image edit pairs. Training
a *video* model on still images is admittedly not ideal, but it was a way
to push the editing vocabulary beyond what a small video-only dataset can
teach.
2. **Stage 2 β€” video fine-tune with `first_frame_conditioning > 0`.** This
restored the temporal prior and unlocked the motion-transfer behaviour
described below.
In theory v0.1 can do the same edits as v1.0, but **temporal consistency may
be weaker than v1.0** because so much of stage 1 happened on still images.
Test against v1.0 case-by-case before assuming v0.1 wins on your task.
### Motion transfer
Because stage 2 included first-frame conditioning, you can drive the LoRA
into a motion-transfer mode:
1. Take a guide video.
2. **Replace its first frame** with an edited still (insert a new subject,
swap an object, etc.). Use a strong image-editing model β€” **Flux Klein**
or similar β€” to prepare it; the quality of this single frame propagates
through the whole clip.
3. Feed the edited frame as the first frame of the input, and the original
guide video as the motion source.
The model uses the new first frame as the appearance anchor and copies the
motion from the rest of the guide.
Limitations (these are real, not theoretical β€” expect them to bite):
- **Hard scene cuts break it.** The model assumes continuous motion from
the first frame onwards. A cut to a different camera angle or location
mid-clip will produce smearing, ghosting, or the inserted subject jumping
to the wrong position. Use clips without cuts, or split at the cuts and
process each segment separately.
- **Very fast motion fails.** Quick pans, fast subject movement, or
high-velocity action confuse the motion-copy mechanism. Outputs degrade
to blur or to the model "freezing" on the first-frame appearance and
losing the motion entirely. Stick to moderate-speed clips.
- Poor blending / artefacts in the first frame propagate everywhere.
- Works best when the inserted subject roughly occupies the same region as
whatever it replaces.
### Prompting
Prompt is just as critical as in v1.0. **Describe both the object being
replaced and the new one in detail**. Example: *"Replace the bronze statue on
the left with a tall man wearing a navy raincoat and brown boots."* Vague
prompts produce bad edits.
### Which rank to use
The same training produced both files. v0.1 is actually the merge of the
two-stage training (one LoRA per stage), re-extracted at two different ranks
via Frobenius-optimal truncated SVD:
| File | Rank | Size | Frobenius retention |
|---|---|---|---|
| `edit_anything_30k_v0.1_motion_transfer_r128.safetensors` | 128 | 1.31 GB | ~99.4% |
| `edit_anything_30k_v0.1_motion_transfer_r256.safetensors` | 256 | 2.62 GB | ~99.9% |
r256 is closer to the merged source. r128 is normally indistinguishable in
practice. Pick whichever fits your workflow.
### How to wire the LoopingSampler
This is a **standard LoRA**, not a sidecar. Load it through the regular
ComfyUI LoraLoader **before** the LoopingSampler. On the sampler itself:
- `editanything_module` β†’ **leave disconnected**.
- `ref_image` β†’ the edited first frame (for motion transfer) **or** the
source frame you want preserved (for plain editing).
- `guide_frames` β†’ the guide video.
- `enable_role_embedding`, `enable_adaln`, `enable_visual_crossattn` β†’
all **off**. None of those branches were trained for v0.1; turning them
on with no module connected does nothing anyway, but keeping them off
silences the WARN logs.
---
## 2. Edit Anything β€” no-reference multitask LoRA
File:
- `edit_anything_v1.1_r256.safetensors`
### What it is
A **prompt-only** multitask editing LoRA. No reference image, no first-frame
conditioning β€” the model is driven entirely by the text prompt and the
guide video. Trained on a balanced mix of **Add, Remove, Replace, Style**
edits.
### What it's different about it (vs v0.1)
The task vocabulary overlaps heavily with v0.1 β€” both can do Add, Remove,
Replace, Change, Convert. What changes here:
- **Two-stage training continuation**: the first stage gave the model its
edit vocabulary; the second stage refined it on a larger, more balanced
video pair set covering Add / Remove / Replace / Style.
- **Rank 256** (vs v0.1's effective rank from the merge), giving more
capacity for the broader task mix.
- Trained directly on video pairs, so the temporal behaviour on these
tasks tends to be steadier than on a model whose first stage was on
still images.
### How to use it
**Standalone** β€” load it as a regular LoRA on vanilla LTX-2.3 through any
ComfyUI LoRA loader. The file already carries everything it needs; no
stacking with v0.1, no companion module.
### Limitations
- No reference image β†’ identity is not anchored, so Add / Replace of a
specific person or object will be wobblier than the Ref V2V build.
- No motion transfer (that's v0.1 only).
### Prompting
Same imperative shape as v0.1, but the training set split into four very
distinct caption styles. Match the one that fits the edit you want β€” the
distribution is narrow and the model expects the right shape.
The training set is roughly balanced across **Add, Remove, Replace and
Style** buckets, with Style being the smallest of the four. Captions
below are real examples drawn from those buckets.
#### Add β€” 15 to 30+ words, describe what to add and where
* `Add a smiling woman with brown hair, wearing a pink sleeveless top, sitting to the right of the man at the news desk.`
* `Add a person wearing a blue denim shirt over a white t-shirt to the right side of the frame, behind the person cooking.`
* `Add a decorated Christmas tree with red and white ornaments and lights to the right of the man.`
* `Add a blonde boy wearing a black t-shirt with a blue collar and blue patterned pants, sitting behind the other children in the upper center of the frame.`
* `Add two horizontal wooden strips to the front of the white range hood.`
Pattern: `Add <detailed subject description>, <position in frame>, <surrounding context>.`
#### Remove β€” very short, 4 to 10 words
* `Remove the man drinking from a glass.`
* `Remove the disco ball.`
* `Remove the large tree on the right.`
* `Remove the squirrel in the foreground.`
* `Remove the man on the left.`
Pattern: `Remove the <object>` (+ optional position). Resist the urge to
over-describe β€” long Remove prompts drift outside the training shape and
often fail.
#### Replace β€” 20 to 35 words, describe both old and new
* `Replace the white panel door on the right side of the frame with a dark brown grandfather clock.`
* `Replace the light-colored cat lying on the mat on the floor with a young woman sitting on the mat.`
* `Replace the dark grey knitted sweater on the man's torso with a black and white patterned Christmas sweater.`
* `Replace the blue robot with a glowing blue face on the left with a smiling man wearing sunglasses and a blue shirt.`
* `Replace the sitting person wearing a black cape on the left with a black fabric draped over an object.`
Pattern: `Replace <description of the original subject and its location> with <description of the new subject>.`
#### Style β€” fixed template, the style name is what changes
* `Convert the video into a Pencil Sketch style.`
* `Convert the video into a Watercolor Painting style.`
* `Convert the video into a Van Gogh style.`
* `Convert the video into a Play-Doh style.`
* `Convert the video into a Claymation style.`
* `Convert the video into a 3D Chibi style.`
* `Convert the video into a Ghibli style.`
* `Convert the video into a Pop Art style.`
* `Convert the video into an American Cartoon style.`
* `Convert the video into a Flat Vector Cartoon style.`
The training set covers **300+ distinct style names**. Many work; many do
not. The list above is heavily represented in training. Use the exact
phrase `Convert the video into a <STYLE> style` β€” deviations from this
template degrade quality noticeably.
#### What it does *not* do
These are honest limits of the training distribution β€” don't expect them
to work just because the model is multitask:
- **No compositional prompts.** *"Add X and remove Y"*, *"Replace A with B
and add C"*, etc. are **not** in the training set. Captions combining
two action verbs are essentially absent (the only ones present are the
"Remove X and replace with Y" idiom, which is really a single Replace).
Pure multi-action edits will fall apart β€” split them into separate runs.
- **No "change background" as a task.** Background is only used as a
*positional reference* ("in the background", "on the wall in the
background"). To swap the entire backdrop, phrase it as a **Replace**
on a concrete background element, e.g.
*"Replace the brick wall in the background with a forest at sunset"*.
Vague prompts like *"Change the background to a beach"* are off-
distribution and rarely work.
- **No global colour grade / lighting change.** Only the Style template
is trained as a global transform. Anything else global (LUT-style
adjustments, time-of-day swaps without a concrete object) is unreliable.
### Which LoRA should I use?
| If you want… | Use |
|---|---|
| Motion transfer (edit first frame externally, model copies motion) | **v0.1 motion transfer** |
| Multi-task edits (add / remove / replace / style) driven only by prompt | **no-ref multitask r256** (standalone) |
| Strong identity transfer from a reference image (Add / Replace) | **Ref V2V** |
### How to wire the LoopingSampler
A single **standard LoRA**, no sidecar, no stacking. Load through one
ComfyUI LoraLoader before the LoopingSampler. On the sampler:
- `editanything_module` β†’ **leave disconnected**.
- `ref_image` β†’ **leave disconnected**. This LoRA has no reference-image
path; passing one will just pre-encode tokens that you do not want.
- `guide_frames` β†’ the guide video.
- `enable_role_embedding`, `enable_adaln`, `enable_visual_crossattn` β†’
all **off** (no module = nothing to inject anyway).
---
## 3. Reference video-to-video (Ref V2V) β€” experimental
Files (two builds of the same LoRA family β€” each ships as a `(.standard, .module)` pair):
- `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.standard.safetensors`
- `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.module.safetensors`
- `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.standard.safetensors`
- `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.module.safetensors`
### What it is
The goal is **add / replace using a reference image** β€” same vibe as Edit
Anything v1.0, but with an explicit image as the appearance source instead
of relying only on the prompt.
Trained on **~1600** Add / Replace video pairs. Reference-paired video
datasets are basically nonexistent, so the dataset had to be built from
scratch β€” that is why the sample count is small. **It often fails.** This
is fully experimental; thousands of training runs went into landing on this
LoRA layout, and it is still unclear how much it actually helps.
### Architecture β€” why this LoRA has "modules"
Trained as a conventional IC-LoRA, plus extra projection branches that try
to make the reference signal survive across layers:
- **`ref_visual_proj`** β€” projects the reference VAE latent into 32 visual
memory tokens.
- **`ref_attn`** β€” a dedicated cross-attention branch inside each
transformer block, reading those tokens.
- **`ref_adaln_proj`** β€” a global AdaLN bias derived from the reference
(palette / overall look).
- **`role_embedding`** β€” an experimental token bias inspired by some of
Kijai's tests; whether it actually helps is still unclear.
These extra weights are saved alongside the LoRA in a `.module.safetensors`
sidecar because they are **not standard LoRA adapters** β€” the regular
ComfyUI LoRA loader can't consume them, so they need a dedicated node.
### How to load
| File | What it is | Where it goes |
|---|---|---|
| `*.standard.safetensors` | LoRA on `attn1` / `attn2` / `ff` only | Standard ComfyUI LoRA loader |
| `*.module.safetensors` | `role_embedding`, `ref_adaln_proj`, `ref_visual_proj`, `ref_attn` LoRA adapters | `LTXVEditAnythingModuleLoader` (BFSnodes) |
Both files of a pair must be loaded **together** β€” the LoRA was trained
against the sidecar adapters and they only make sense as a unit. Do not mix
`.standard` from one build with `.module` from another.
The module file is consumed by the **`πŸ…›πŸ…£πŸ…§ LTXV Edit Anything Looping
Sampler`** node, which was written specifically to:
1. Install the `ref_attn` cross-attention branch on every transformer block.
2. Inject the AdaLN / role / visual cross-attention conditioning at the
correct points in the model.
3. Sample long videos in overlapping chunks with the conditioning re-applied
per chunk.
### How to wire the LoopingSampler
- Load the `*.standard.safetensors` through a normal ComfyUI LoraLoader
before the sampler.
- Load the `*.module.safetensors` through `LTXVEditAnythingModuleLoader`
and connect its `editanything_module` output to the sampler.
- On the sampler:
- `editanything_module` β†’ **the module loader output** (required).
- `ref_image` β†’ **the reference image** (required β€” this is what
`Add` / `Replace` will insert).
- `guide_frames` β†’ the source video to edit.
- `enable_adaln` β†’ **on** (defaults match training).
- `enable_visual_crossattn` β†’ **on** for the 4-extras build; off (or
will be a no-op) for the 2-extras build.
- `enable_role_embedding` β†’ **off** for the 4-extras build (training
config disabled it). On if you're loading the 2-extras build alone.
Missing `ref_image` here silently disables AdaLN and the visual
cross-attention β€” the sampler will warn in the log.
### Which build to use
- **`ref_adaln_proj-role_embedding`** β€” the original training. Only ships
the two side-channel modules.
- **`ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`** β€” the
continuation. Adds the visual cross-attention branch and its projector on
top.
It is genuinely **not clear yet** whether the extra branches help over the
plain LoRA. Both builds are honest experiments. Try both, decide for your
own use case, and please share findings.
### Reading the layers
For anyone who wants to understand what each layer in the Ref V2V
checkpoint does:
- [`lora_layers_reference.md`](./lora_layers_reference.md) β€” full tensor
inventory of both builds.
- [`lora_layers_impact.md`](./lora_layers_impact.md) β€” what each branch
contributes at inference and which inference knob (`adaln_scale`,
`ref_context_scale`, `ref_token_scale`, `ref_start_block`,
`ref_end_block`, etc.) maps back to which training default.
---
## Prompt examples
The two LoRAs were trained on very different caption styles. Match the
style of whichever LoRA you're using β€” straying outside the training
distribution is the fastest way to get garbage out.
### Edit Anything v0.1 β€” standard editing
The stage-1 dataset uses short imperative captions describing one or two
edits. Use the same shape at inference. Examples drawn from the training
distribution:
- *"Replace the stone statue of a man on the left with a young woman in a
green dress."*
- *"Add a black labrador retriever sitting beside the woman on the bench."*
- *"Remove the teacher from the classroom."*
- *"Alter the cap's colour from modern black to deep maroon."*
- *"Replace the fresh citrus-green background with a wooden desk."*
- *"Add faint tire tracks across the snow behind the car."*
- *"Add a black statue, a blue camera, a cyan towel, a red guitar and a
pink backpack to the lakeside pier."*
Tips:
- Imperative verbs: **Add / Replace / Remove / Alter / Change**.
- When replacing, **describe both** the original and the new subject so the
model can localise the edit.
- Keep captions short and concrete. Long flowery prose hurts.
### Edit Anything v0.1 β€” motion transfer
Workflow:
1. Pick a guide video.
2. Edit **only the first frame** externally (Flux Klein or any
capable image-edit model) to introduce the new subject in the desired
pose and position.
3. Feed the edited frame as the first frame of the input and the original
guide as motion source.
4. The prompt should describe **the inserted subject and the action being
preserved**.
Examples:
- *"Replace the standing man holding the umbrella with a woman in a red
coat holding the same umbrella, walking across the puddles."*
- *"Add a tabby cat curled up in the armchair while the man in the
background keeps reading."*
- *"Replace the runner in the blue jersey with a man wearing a white shirt
and grey shorts running along the same path."*
Limits: fast or chaotic motion will fail; the inserted subject should
occupy roughly the same region/scale as what it replaces.
### Reference V2V (Ref V2V) β€” Add and Replace
These captions are real samples from the ~1600-pair training set. They
describe the **target scene after the edit** in detail. The reference
image carries the *appearance* of the inserted subject; the caption
carries *position, pose, action, and surrounding context*.
**Add task** (the reference image holds the new subject):
- *"Add a middle-aged man with curly grey hair, a beard and glasses,
wearing a blue quarter-zip sweater, on the right side of the frame,
standing in front of a raw cut of meat on a tray."*
- *"Add a light-coloured small boat with dark seats and an outboard motor
floating in the water."*
- *"Add an open book filled with colourful pencils in the woman's hands."*
- *"Add a silver metallic bucket on the table in front of the blonde
character, with her hands stirring a mixture inside."*
- *"Add two miniature dolls, one blonde and one brunette, dressed in
patterned clothing, sitting at a small table with teacups and small
white vases on the countertop."*
**Replace task** (the reference image holds the new subject; the caption
also describes what is being replaced):
- *"Replace the standing kangaroo holding the bicycle handlebars with a
man wearing a white t-shirt, light brown shorts and a yellow cap,
holding the bicycle handlebars."*
- *"Replace the stone statue of a man on the left side with a young woman
in a green dress."*
- *"Replace the wooden barrel near the entrance with a large brown leather
suitcase."*
Tips for Ref V2V:
- **Describe the inserted subject in full**, even though the reference
image is the source of truth β€” the text path drives placement and pose.
- For *Replace*, **also describe what is being replaced** so the model can
match the spatial region.
- Keep the inserted subject roughly in the same scale and region as what
it replaces.
- The captions in the training set average ~25–40 words β€” aim for that
range. Single-sentence captions like *"Add a man"* are far too sparse
and will fail.
---
## Inference tips (applies to all models)
**CFG matters a lot here.** The default workflow runs with the LTX-2.3
**distilled / acceleration LoRAs** for fast 4–8 step sampling, which
locks **CFG = 1.0**. That's fine for casual runs, but at CFG 1 the model
follows the prompt loosely β€” you get the reference image to "show up"
but the edit instruction itself is only weakly enforced.
**For harder prompts, raise CFG above 1.0.** This means dropping (or
weakening) the distilled / acceleration LoRAs and going back to a normal
sampler with more steps β€” significantly slower, but the model follows
the prompt much more closely. Trade-off:
Other knobs:
- If the model is **ignoring the prompt** (edit isn't being applied, the
reference is barely showing up, the style transfer is faint), raising
CFG is the single most common fix. Go up to 6–8 if needed.
- If results look **over-saturated, plasticky, or motion is freezing**,
CFG is too high β€” pull back toward 3–4 or re-enable the distilled LoRA
for CFG 1 if you don't actually need stronger prompt adherence.
- Ref V2V in particular benefits from being more aggressive with CFG when
the reference identity isn't transferring cleanly.
- Combine CFG tuning with the LoRA-specific knobs from each section
(`adaln_scale`, `ref_context_scale`, `ref_token_scale` for Ref V2V;
prompt rewriting for v0.1 / no-ref).
Treat CFG as a real knob, not a constant β€” and be ready to give up some
speed when you actually need the edit to land.
---
## ComfyUI nodes
All recommended inference paths run through the **BFSnodes** custom node
set. For now BFSnodes is the only place these nodes live; once they
stabilise they may move elsewhere.
Specific nodes used by these LoRAs:
- `LTXVEditAnythingApply` β€” load the LoRA + extras and patch the model.
- `πŸ…›πŸ…£πŸ…§ LTXV Edit Anything Looping Sampler` β€” sampler that injects role /
AdaLN / visual cross-attention and handles long videos in chunks.
- `LTXVEditAnythingModuleLoader` β€” load the `*.module.safetensors` sidecar.
---
## Status
Released as experimental research artefacts. Expect failures, do not
deploy, and please report what works and what doesn't.
---
## Credits
If you use these models β€” in a project, a demo, a paper, a video, a tweet,
a workflow, anything β€” **please credit my work**. These checkpoints are the
result of weeks of research, dataset building, and training runs, and that
effort is what makes any of it usable. Crediting the source is the bare
minimum that keeps open research like this sustainable.
**Author:** Alisson Pereira dos Anjos ([@Alissonerdx](https://huggingface.co/Alissonerdx))
Suggested attribution:
> Edit Anything LoRAs by Alisson Pereira dos Anjos
> ([huggingface.co/Alissonerdx/EditAnything](https://huggingface.co/Alissonerdx/EditAnything)).
Links back to this repository are appreciated wherever you publish results.