Instructions to use Alissonerdx/EditAnything with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use Alissonerdx/EditAnything with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("Lightricks/LTX-2.3", dtype=torch.bfloat16, device_map="cuda") pipe.load_lora_weights("Alissonerdx/EditAnything") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Draw Things
Edit Anything β Experimental LTX-2 Video Editing LoRAs
Heads up. These LoRAs are research experiments. They are far from production-ready and will fail on many inputs. They are released for the community to play with and break, not as a finished tool.
This repository hosts three unrelated training tracks built on top of LTX-2.3 (22B) for video editing:
- Edit Anything v0.1 β motion transfer LoRA (two ranks).
- Edit Anything β no-reference multitask LoRA (rank 256, prompt-driven only).
- Reference video-to-video (Ref V2V) β experimental IC-LoRA + sidecar modules (two builds).
Inference is meant to run through the BFSnodes ComfyUI custom nodes β the Ref V2V build in particular needs them to load the sidecar modules and install the custom branches into the transformer.
1. Edit Anything v0.1 (motion transfer)
Files:
edit_anything_30k_v0.1_motion_transfer_r128.safetensorsedit_anything_30k_v0.1_motion_transfer_r256.safetensors
What it is
v0.1 is not a direct continuation of v1.0. It was trained from scratch in two stages:
- Stage 1 β image-only pretraining. ~30 000 image edit pairs. Training a video model on still images is admittedly not ideal, but it was a way to push the editing vocabulary beyond what a small video-only dataset can teach.
- Stage 2 β video fine-tune with
first_frame_conditioning > 0. This restored the temporal prior and unlocked the motion-transfer behaviour described below.
In theory v0.1 can do the same edits as v1.0, but temporal consistency may be weaker than v1.0 because so much of stage 1 happened on still images. Test against v1.0 case-by-case before assuming v0.1 wins on your task.
Motion transfer
Because stage 2 included first-frame conditioning, you can drive the LoRA into a motion-transfer mode:
- Take a guide video.
- Replace its first frame with an edited still (insert a new subject, swap an object, etc.). Use a strong image-editing model β Flux Klein or similar β to prepare it; the quality of this single frame propagates through the whole clip.
- Feed the edited frame as the first frame of the input, and the original guide video as the motion source.
The model uses the new first frame as the appearance anchor and copies the motion from the rest of the guide.
Limitations (these are real, not theoretical β expect them to bite):
- Hard scene cuts break it. The model assumes continuous motion from the first frame onwards. A cut to a different camera angle or location mid-clip will produce smearing, ghosting, or the inserted subject jumping to the wrong position. Use clips without cuts, or split at the cuts and process each segment separately.
- Very fast motion fails. Quick pans, fast subject movement, or high-velocity action confuse the motion-copy mechanism. Outputs degrade to blur or to the model "freezing" on the first-frame appearance and losing the motion entirely. Stick to moderate-speed clips.
- Poor blending / artefacts in the first frame propagate everywhere.
- Works best when the inserted subject roughly occupies the same region as whatever it replaces.
Prompting
Prompt is just as critical as in v1.0. Describe both the object being replaced and the new one in detail. Example: "Replace the bronze statue on the left with a tall man wearing a navy raincoat and brown boots." Vague prompts produce bad edits.
Which rank to use
The same training produced both files. v0.1 is actually the merge of the two-stage training (one LoRA per stage), re-extracted at two different ranks via Frobenius-optimal truncated SVD:
| File | Rank | Size | Frobenius retention |
|---|---|---|---|
edit_anything_30k_v0.1_motion_transfer_r128.safetensors |
128 | 1.31 GB | ~99.4% |
edit_anything_30k_v0.1_motion_transfer_r256.safetensors |
256 | 2.62 GB | ~99.9% |
r256 is closer to the merged source. r128 is normally indistinguishable in practice. Pick whichever fits your workflow.
How to wire the LoopingSampler
This is a standard LoRA, not a sidecar. Load it through the regular ComfyUI LoraLoader before the LoopingSampler. On the sampler itself:
editanything_moduleβ leave disconnected.ref_imageβ the edited first frame (for motion transfer) or the source frame you want preserved (for plain editing).guide_framesβ the guide video.enable_role_embedding,enable_adaln,enable_visual_crossattnβ all off. None of those branches were trained for v0.1; turning them on with no module connected does nothing anyway, but keeping them off silences the WARN logs.
2. Edit Anything β no-reference multitask LoRA
File:
edit_anything_v1.1_r256.safetensors
What it is
A prompt-only multitask editing LoRA. No reference image, no first-frame conditioning β the model is driven entirely by the text prompt and the guide video. Trained on a balanced mix of Add, Remove, Replace, Style edits.
What it's different about it (vs v0.1)
The task vocabulary overlaps heavily with v0.1 β both can do Add, Remove, Replace, Change, Convert. What changes here:
- Two-stage training continuation: the first stage gave the model its edit vocabulary; the second stage refined it on a larger, more balanced video pair set covering Add / Remove / Replace / Style.
- Rank 256 (vs v0.1's effective rank from the merge), giving more capacity for the broader task mix.
- Trained directly on video pairs, so the temporal behaviour on these tasks tends to be steadier than on a model whose first stage was on still images.
How to use it
Standalone β load it as a regular LoRA on vanilla LTX-2.3 through any ComfyUI LoRA loader. The file already carries everything it needs; no stacking with v0.1, no companion module.
Limitations
- No reference image β identity is not anchored, so Add / Replace of a specific person or object will be wobblier than the Ref V2V build.
- No motion transfer (that's v0.1 only).
Prompting
Same imperative shape as v0.1, but the training set split into four very distinct caption styles. Match the one that fits the edit you want β the distribution is narrow and the model expects the right shape.
The training set is roughly balanced across Add, Remove, Replace and Style buckets, with Style being the smallest of the four. Captions below are real examples drawn from those buckets.
Add β 15 to 30+ words, describe what to add and where
Add a smiling woman with brown hair, wearing a pink sleeveless top, sitting to the right of the man at the news desk.Add a person wearing a blue denim shirt over a white t-shirt to the right side of the frame, behind the person cooking.Add a decorated Christmas tree with red and white ornaments and lights to the right of the man.Add a blonde boy wearing a black t-shirt with a blue collar and blue patterned pants, sitting behind the other children in the upper center of the frame.Add two horizontal wooden strips to the front of the white range hood.
Pattern: Add <detailed subject description>, <position in frame>, <surrounding context>.
Remove β very short, 4 to 10 words
Remove the man drinking from a glass.Remove the disco ball.Remove the large tree on the right.Remove the squirrel in the foreground.Remove the man on the left.
Pattern: Remove the <object> (+ optional position). Resist the urge to
over-describe β long Remove prompts drift outside the training shape and
often fail.
Replace β 20 to 35 words, describe both old and new
Replace the white panel door on the right side of the frame with a dark brown grandfather clock.Replace the light-colored cat lying on the mat on the floor with a young woman sitting on the mat.Replace the dark grey knitted sweater on the man's torso with a black and white patterned Christmas sweater.Replace the blue robot with a glowing blue face on the left with a smiling man wearing sunglasses and a blue shirt.Replace the sitting person wearing a black cape on the left with a black fabric draped over an object.
Pattern: Replace <description of the original subject and its location> with <description of the new subject>.
Style β fixed template, the style name is what changes
Convert the video into a Pencil Sketch style.Convert the video into a Watercolor Painting style.Convert the video into a Van Gogh style.Convert the video into a Play-Doh style.Convert the video into a Claymation style.Convert the video into a 3D Chibi style.Convert the video into a Ghibli style.Convert the video into a Pop Art style.Convert the video into an American Cartoon style.Convert the video into a Flat Vector Cartoon style.
The training set covers 300+ distinct style names. Many work; many do
not. The list above is heavily represented in training. Use the exact
phrase Convert the video into a <STYLE> style β deviations from this
template degrade quality noticeably.
What it does not do
These are honest limits of the training distribution β don't expect them to work just because the model is multitask:
- No compositional prompts. "Add X and remove Y", "Replace A with B and add C", etc. are not in the training set. Captions combining two action verbs are essentially absent (the only ones present are the "Remove X and replace with Y" idiom, which is really a single Replace). Pure multi-action edits will fall apart β split them into separate runs.
- No "change background" as a task. Background is only used as a positional reference ("in the background", "on the wall in the background"). To swap the entire backdrop, phrase it as a Replace on a concrete background element, e.g. "Replace the brick wall in the background with a forest at sunset". Vague prompts like "Change the background to a beach" are off- distribution and rarely work.
- No global colour grade / lighting change. Only the Style template is trained as a global transform. Anything else global (LUT-style adjustments, time-of-day swaps without a concrete object) is unreliable.
Which LoRA should I use?
| If you want⦠| Use |
|---|---|
| Motion transfer (edit first frame externally, model copies motion) | v0.1 motion transfer |
| Multi-task edits (add / remove / replace / style) driven only by prompt | no-ref multitask r256 (standalone) |
| Strong identity transfer from a reference image (Add / Replace) | Ref V2V |
How to wire the LoopingSampler
A single standard LoRA, no sidecar, no stacking. Load through one ComfyUI LoraLoader before the LoopingSampler. On the sampler:
editanything_moduleβ leave disconnected.ref_imageβ leave disconnected. This LoRA has no reference-image path; passing one will just pre-encode tokens that you do not want.guide_framesβ the guide video.enable_role_embedding,enable_adaln,enable_visual_crossattnβ all off (no module = nothing to inject anyway).
3. Reference video-to-video (Ref V2V) β experimental
Files (two builds of the same LoRA family β each ships as a (.standard, .module) pair):
edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.standard.safetensorsedit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.module.safetensorsedit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.standard.safetensorsedit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.module.safetensors
What it is
The goal is add / replace using a reference image β same vibe as Edit Anything v1.0, but with an explicit image as the appearance source instead of relying only on the prompt.
Trained on ~1600 Add / Replace video pairs. Reference-paired video datasets are basically nonexistent, so the dataset had to be built from scratch β that is why the sample count is small. It often fails. This is fully experimental; thousands of training runs went into landing on this LoRA layout, and it is still unclear how much it actually helps.
Architecture β why this LoRA has "modules"
Trained as a conventional IC-LoRA, plus extra projection branches that try to make the reference signal survive across layers:
ref_visual_projβ projects the reference VAE latent into 32 visual memory tokens.ref_attnβ a dedicated cross-attention branch inside each transformer block, reading those tokens.ref_adaln_projβ a global AdaLN bias derived from the reference (palette / overall look).role_embeddingβ an experimental token bias inspired by some of Kijai's tests; whether it actually helps is still unclear.
These extra weights are saved alongside the LoRA in a .module.safetensors
sidecar because they are not standard LoRA adapters β the regular
ComfyUI LoRA loader can't consume them, so they need a dedicated node.
How to load
| File | What it is | Where it goes |
|---|---|---|
*.standard.safetensors |
LoRA on attn1 / attn2 / ff only |
Standard ComfyUI LoRA loader |
*.module.safetensors |
role_embedding, ref_adaln_proj, ref_visual_proj, ref_attn LoRA adapters |
LTXVEditAnythingModuleLoader (BFSnodes) |
Both files of a pair must be loaded together β the LoRA was trained
against the sidecar adapters and they only make sense as a unit. Do not mix
.standard from one build with .module from another.
The module file is consumed by the π
π
£π
§ LTXV Edit Anything Looping Sampler node, which was written specifically to:
- Install the
ref_attncross-attention branch on every transformer block. - Inject the AdaLN / role / visual cross-attention conditioning at the correct points in the model.
- Sample long videos in overlapping chunks with the conditioning re-applied per chunk.
How to wire the LoopingSampler
- Load the
*.standard.safetensorsthrough a normal ComfyUI LoraLoader before the sampler. - Load the
*.module.safetensorsthroughLTXVEditAnythingModuleLoaderand connect itseditanything_moduleoutput to the sampler. - On the sampler:
editanything_moduleβ the module loader output (required).ref_imageβ the reference image (required β this is whatAdd/Replacewill insert).guide_framesβ the source video to edit.enable_adalnβ on (defaults match training).enable_visual_crossattnβ on for the 4-extras build; off (or will be a no-op) for the 2-extras build.enable_role_embeddingβ off for the 4-extras build (training config disabled it). On if you're loading the 2-extras build alone.
Missing ref_image here silently disables AdaLN and the visual
cross-attention β the sampler will warn in the log.
Which build to use
ref_adaln_proj-role_embeddingβ the original training. Only ships the two side-channel modules.ref_adaln_proj-role_embedding-ref_attn-ref_visual_projβ the continuation. Adds the visual cross-attention branch and its projector on top.
It is genuinely not clear yet whether the extra branches help over the plain LoRA. Both builds are honest experiments. Try both, decide for your own use case, and please share findings.
Reading the layers
For anyone who wants to understand what each layer in the Ref V2V checkpoint does:
lora_layers_reference.mdβ full tensor inventory of both builds.lora_layers_impact.mdβ what each branch contributes at inference and which inference knob (adaln_scale,ref_context_scale,ref_token_scale,ref_start_block,ref_end_block, etc.) maps back to which training default.
Prompt examples
The two LoRAs were trained on very different caption styles. Match the style of whichever LoRA you're using β straying outside the training distribution is the fastest way to get garbage out.
Edit Anything v0.1 β standard editing
The stage-1 dataset uses short imperative captions describing one or two edits. Use the same shape at inference. Examples drawn from the training distribution:
- "Replace the stone statue of a man on the left with a young woman in a green dress."
- "Add a black labrador retriever sitting beside the woman on the bench."
- "Remove the teacher from the classroom."
- "Alter the cap's colour from modern black to deep maroon."
- "Replace the fresh citrus-green background with a wooden desk."
- "Add faint tire tracks across the snow behind the car."
- "Add a black statue, a blue camera, a cyan towel, a red guitar and a pink backpack to the lakeside pier."
Tips:
- Imperative verbs: Add / Replace / Remove / Alter / Change.
- When replacing, describe both the original and the new subject so the model can localise the edit.
- Keep captions short and concrete. Long flowery prose hurts.
Edit Anything v0.1 β motion transfer
Workflow:
- Pick a guide video.
- Edit only the first frame externally (Flux Klein or any capable image-edit model) to introduce the new subject in the desired pose and position.
- Feed the edited frame as the first frame of the input and the original guide as motion source.
- The prompt should describe the inserted subject and the action being preserved.
Examples:
- "Replace the standing man holding the umbrella with a woman in a red coat holding the same umbrella, walking across the puddles."
- "Add a tabby cat curled up in the armchair while the man in the background keeps reading."
- "Replace the runner in the blue jersey with a man wearing a white shirt and grey shorts running along the same path."
Limits: fast or chaotic motion will fail; the inserted subject should occupy roughly the same region/scale as what it replaces.
Reference V2V (Ref V2V) β Add and Replace
These captions are real samples from the ~1600-pair training set. They describe the target scene after the edit in detail. The reference image carries the appearance of the inserted subject; the caption carries position, pose, action, and surrounding context.
Add task (the reference image holds the new subject):
- "Add a middle-aged man with curly grey hair, a beard and glasses, wearing a blue quarter-zip sweater, on the right side of the frame, standing in front of a raw cut of meat on a tray."
- "Add a light-coloured small boat with dark seats and an outboard motor floating in the water."
- "Add an open book filled with colourful pencils in the woman's hands."
- "Add a silver metallic bucket on the table in front of the blonde character, with her hands stirring a mixture inside."
- "Add two miniature dolls, one blonde and one brunette, dressed in patterned clothing, sitting at a small table with teacups and small white vases on the countertop."
Replace task (the reference image holds the new subject; the caption also describes what is being replaced):
- "Replace the standing kangaroo holding the bicycle handlebars with a man wearing a white t-shirt, light brown shorts and a yellow cap, holding the bicycle handlebars."
- "Replace the stone statue of a man on the left side with a young woman in a green dress."
- "Replace the wooden barrel near the entrance with a large brown leather suitcase."
Tips for Ref V2V:
- Describe the inserted subject in full, even though the reference image is the source of truth β the text path drives placement and pose.
- For Replace, also describe what is being replaced so the model can match the spatial region.
- Keep the inserted subject roughly in the same scale and region as what it replaces.
- The captions in the training set average ~25β40 words β aim for that range. Single-sentence captions like "Add a man" are far too sparse and will fail.
Inference tips (applies to all models)
CFG matters a lot here. The default workflow runs with the LTX-2.3 distilled / acceleration LoRAs for fast 4β8 step sampling, which locks CFG = 1.0. That's fine for casual runs, but at CFG 1 the model follows the prompt loosely β you get the reference image to "show up" but the edit instruction itself is only weakly enforced.
For harder prompts, raise CFG above 1.0. This means dropping (or weakening) the distilled / acceleration LoRAs and going back to a normal sampler with more steps β significantly slower, but the model follows the prompt much more closely. Trade-off:
Other knobs:
- If the model is ignoring the prompt (edit isn't being applied, the reference is barely showing up, the style transfer is faint), raising CFG is the single most common fix. Go up to 6β8 if needed.
- If results look over-saturated, plasticky, or motion is freezing, CFG is too high β pull back toward 3β4 or re-enable the distilled LoRA for CFG 1 if you don't actually need stronger prompt adherence.
- Ref V2V in particular benefits from being more aggressive with CFG when the reference identity isn't transferring cleanly.
- Combine CFG tuning with the LoRA-specific knobs from each section
(
adaln_scale,ref_context_scale,ref_token_scalefor Ref V2V; prompt rewriting for v0.1 / no-ref).
Treat CFG as a real knob, not a constant β and be ready to give up some speed when you actually need the edit to land.
ComfyUI nodes
All recommended inference paths run through the BFSnodes custom node set. For now BFSnodes is the only place these nodes live; once they stabilise they may move elsewhere.
Specific nodes used by these LoRAs:
LTXVEditAnythingApplyβ load the LoRA + extras and patch the model.π π £π § LTXV Edit Anything Looping Samplerβ sampler that injects role / AdaLN / visual cross-attention and handles long videos in chunks.LTXVEditAnythingModuleLoaderβ load the*.module.safetensorssidecar.
Status
Released as experimental research artefacts. Expect failures, do not deploy, and please report what works and what doesn't.
Credits
If you use these models β in a project, a demo, a paper, a video, a tweet, a workflow, anything β please credit my work. These checkpoints are the result of weeks of research, dataset building, and training runs, and that effort is what makes any of it usable. Crediting the source is the bare minimum that keeps open research like this sustainable.
Author: Alisson Pereira dos Anjos (@Alissonerdx)
Suggested attribution:
Edit Anything LoRAs by Alisson Pereira dos Anjos (huggingface.co/Alissonerdx/EditAnything).
Links back to this repository are appreciated wherever you publish results.
- Downloads last month
- 107
Model tree for Alissonerdx/EditAnything
Base model
Lightricks/LTX-2.3