Instructions to use Alissonerdx/EditAnything with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use Alissonerdx/EditAnything with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("Lightricks/LTX-2.3", dtype=torch.bfloat16, device_map="cuda") pipe.load_lora_weights("Alissonerdx/EditAnything") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Draw Things
| license: apache-2.0 | |
| library_name: diffusers | |
| base_model: Lightricks/LTX-2.3 | |
| tags: | |
| - lora | |
| - video | |
| - video-editing | |
| - ltx-2.3 | |
| # Edit Anything β Experimental LTX-2 Video Editing LoRAs | |
| > **Heads up.** These LoRAs are research experiments. They are far from | |
| > production-ready and will fail on many inputs. They are released for the | |
| > community to play with and break, not as a finished tool. | |
| This repository hosts three unrelated training tracks built on top of | |
| **LTX-2.3 (22B)** for video editing: | |
| 1. **Edit Anything v0.1 β motion transfer LoRA** (two ranks). | |
| 2. **Edit Anything β no-reference multitask LoRA** (rank 256, prompt-driven only). | |
| 3. **Reference video-to-video (Ref V2V) β experimental IC-LoRA + sidecar modules** (two builds). | |
| Inference is meant to run through the **BFSnodes** ComfyUI custom nodes β | |
| the Ref V2V build in particular needs them to load the sidecar modules and | |
| install the custom branches into the transformer. | |
| --- | |
| ## 1. Edit Anything v0.1 (motion transfer) | |
| Files: | |
| - `edit_anything_30k_v0.1_motion_transfer_r128.safetensors` | |
| - `edit_anything_30k_v0.1_motion_transfer_r256.safetensors` | |
| ### What it is | |
| **v0.1 is not a direct continuation of v1.0.** It was trained from scratch | |
| in two stages: | |
| 1. **Stage 1 β image-only pretraining.** ~30 000 image edit pairs. Training | |
| a *video* model on still images is admittedly not ideal, but it was a way | |
| to push the editing vocabulary beyond what a small video-only dataset can | |
| teach. | |
| 2. **Stage 2 β video fine-tune with `first_frame_conditioning > 0`.** This | |
| restored the temporal prior and unlocked the motion-transfer behaviour | |
| described below. | |
| In theory v0.1 can do the same edits as v1.0, but **temporal consistency may | |
| be weaker than v1.0** because so much of stage 1 happened on still images. | |
| Test against v1.0 case-by-case before assuming v0.1 wins on your task. | |
| ### Motion transfer | |
| Because stage 2 included first-frame conditioning, you can drive the LoRA | |
| into a motion-transfer mode: | |
| 1. Take a guide video. | |
| 2. **Replace its first frame** with an edited still (insert a new subject, | |
| swap an object, etc.). Use a strong image-editing model β **Flux Klein** | |
| or similar β to prepare it; the quality of this single frame propagates | |
| through the whole clip. | |
| 3. Feed the edited frame as the first frame of the input, and the original | |
| guide video as the motion source. | |
| The model uses the new first frame as the appearance anchor and copies the | |
| motion from the rest of the guide. | |
| Limitations (these are real, not theoretical β expect them to bite): | |
| - **Hard scene cuts break it.** The model assumes continuous motion from | |
| the first frame onwards. A cut to a different camera angle or location | |
| mid-clip will produce smearing, ghosting, or the inserted subject jumping | |
| to the wrong position. Use clips without cuts, or split at the cuts and | |
| process each segment separately. | |
| - **Very fast motion fails.** Quick pans, fast subject movement, or | |
| high-velocity action confuse the motion-copy mechanism. Outputs degrade | |
| to blur or to the model "freezing" on the first-frame appearance and | |
| losing the motion entirely. Stick to moderate-speed clips. | |
| - Poor blending / artefacts in the first frame propagate everywhere. | |
| - Works best when the inserted subject roughly occupies the same region as | |
| whatever it replaces. | |
| ### Prompting | |
| Prompt is just as critical as in v1.0. **Describe both the object being | |
| replaced and the new one in detail**. Example: *"Replace the bronze statue on | |
| the left with a tall man wearing a navy raincoat and brown boots."* Vague | |
| prompts produce bad edits. | |
| ### Which rank to use | |
| The same training produced both files. v0.1 is actually the merge of the | |
| two-stage training (one LoRA per stage), re-extracted at two different ranks | |
| via Frobenius-optimal truncated SVD: | |
| | File | Rank | Size | Frobenius retention | | |
| |---|---|---|---| | |
| | `edit_anything_30k_v0.1_motion_transfer_r128.safetensors` | 128 | 1.31 GB | ~99.4% | | |
| | `edit_anything_30k_v0.1_motion_transfer_r256.safetensors` | 256 | 2.62 GB | ~99.9% | | |
| r256 is closer to the merged source. r128 is normally indistinguishable in | |
| practice. Pick whichever fits your workflow. | |
| ### How to wire the LoopingSampler | |
| This is a **standard LoRA**, not a sidecar. Load it through the regular | |
| ComfyUI LoraLoader **before** the LoopingSampler. On the sampler itself: | |
| - `editanything_module` β **leave disconnected**. | |
| - `ref_image` β the edited first frame (for motion transfer) **or** the | |
| source frame you want preserved (for plain editing). | |
| - `guide_frames` β the guide video. | |
| - `enable_role_embedding`, `enable_adaln`, `enable_visual_crossattn` β | |
| all **off**. None of those branches were trained for v0.1; turning them | |
| on with no module connected does nothing anyway, but keeping them off | |
| silences the WARN logs. | |
| --- | |
| ## 2. Edit Anything β no-reference multitask LoRA | |
| File: | |
| - `edit_anything_v1.1_r256.safetensors` | |
| ### What it is | |
| A **prompt-only** multitask editing LoRA. No reference image, no first-frame | |
| conditioning β the model is driven entirely by the text prompt and the | |
| guide video. Trained on a balanced mix of **Add, Remove, Replace, Style** | |
| edits. | |
| ### What it's different about it (vs v0.1) | |
| The task vocabulary overlaps heavily with v0.1 β both can do Add, Remove, | |
| Replace, Change, Convert. What changes here: | |
| - **Two-stage training continuation**: the first stage gave the model its | |
| edit vocabulary; the second stage refined it on a larger, more balanced | |
| video pair set covering Add / Remove / Replace / Style. | |
| - **Rank 256** (vs v0.1's effective rank from the merge), giving more | |
| capacity for the broader task mix. | |
| - Trained directly on video pairs, so the temporal behaviour on these | |
| tasks tends to be steadier than on a model whose first stage was on | |
| still images. | |
| ### How to use it | |
| **Standalone** β load it as a regular LoRA on vanilla LTX-2.3 through any | |
| ComfyUI LoRA loader. The file already carries everything it needs; no | |
| stacking with v0.1, no companion module. | |
| ### Limitations | |
| - No reference image β identity is not anchored, so Add / Replace of a | |
| specific person or object will be wobblier than the Ref V2V build. | |
| - No motion transfer (that's v0.1 only). | |
| ### Prompting | |
| Same imperative shape as v0.1, but the training set split into four very | |
| distinct caption styles. Match the one that fits the edit you want β the | |
| distribution is narrow and the model expects the right shape. | |
| The training set is roughly balanced across **Add, Remove, Replace and | |
| Style** buckets, with Style being the smallest of the four. Captions | |
| below are real examples drawn from those buckets. | |
| #### Add β 15 to 30+ words, describe what to add and where | |
| * `Add a smiling woman with brown hair, wearing a pink sleeveless top, sitting to the right of the man at the news desk.` | |
| * `Add a person wearing a blue denim shirt over a white t-shirt to the right side of the frame, behind the person cooking.` | |
| * `Add a decorated Christmas tree with red and white ornaments and lights to the right of the man.` | |
| * `Add a blonde boy wearing a black t-shirt with a blue collar and blue patterned pants, sitting behind the other children in the upper center of the frame.` | |
| * `Add two horizontal wooden strips to the front of the white range hood.` | |
| Pattern: `Add <detailed subject description>, <position in frame>, <surrounding context>.` | |
| #### Remove β very short, 4 to 10 words | |
| * `Remove the man drinking from a glass.` | |
| * `Remove the disco ball.` | |
| * `Remove the large tree on the right.` | |
| * `Remove the squirrel in the foreground.` | |
| * `Remove the man on the left.` | |
| Pattern: `Remove the <object>` (+ optional position). Resist the urge to | |
| over-describe β long Remove prompts drift outside the training shape and | |
| often fail. | |
| #### Replace β 20 to 35 words, describe both old and new | |
| * `Replace the white panel door on the right side of the frame with a dark brown grandfather clock.` | |
| * `Replace the light-colored cat lying on the mat on the floor with a young woman sitting on the mat.` | |
| * `Replace the dark grey knitted sweater on the man's torso with a black and white patterned Christmas sweater.` | |
| * `Replace the blue robot with a glowing blue face on the left with a smiling man wearing sunglasses and a blue shirt.` | |
| * `Replace the sitting person wearing a black cape on the left with a black fabric draped over an object.` | |
| Pattern: `Replace <description of the original subject and its location> with <description of the new subject>.` | |
| #### Style β fixed template, the style name is what changes | |
| * `Convert the video into a Pencil Sketch style.` | |
| * `Convert the video into a Watercolor Painting style.` | |
| * `Convert the video into a Van Gogh style.` | |
| * `Convert the video into a Play-Doh style.` | |
| * `Convert the video into a Claymation style.` | |
| * `Convert the video into a 3D Chibi style.` | |
| * `Convert the video into a Ghibli style.` | |
| * `Convert the video into a Pop Art style.` | |
| * `Convert the video into an American Cartoon style.` | |
| * `Convert the video into a Flat Vector Cartoon style.` | |
| The training set covers **300+ distinct style names**. Many work; many do | |
| not. The list above is heavily represented in training. Use the exact | |
| phrase `Convert the video into a <STYLE> style` β deviations from this | |
| template degrade quality noticeably. | |
| #### What it does *not* do | |
| These are honest limits of the training distribution β don't expect them | |
| to work just because the model is multitask: | |
| - **No compositional prompts.** *"Add X and remove Y"*, *"Replace A with B | |
| and add C"*, etc. are **not** in the training set. Captions combining | |
| two action verbs are essentially absent (the only ones present are the | |
| "Remove X and replace with Y" idiom, which is really a single Replace). | |
| Pure multi-action edits will fall apart β split them into separate runs. | |
| - **No "change background" as a task.** Background is only used as a | |
| *positional reference* ("in the background", "on the wall in the | |
| background"). To swap the entire backdrop, phrase it as a **Replace** | |
| on a concrete background element, e.g. | |
| *"Replace the brick wall in the background with a forest at sunset"*. | |
| Vague prompts like *"Change the background to a beach"* are off- | |
| distribution and rarely work. | |
| - **No global colour grade / lighting change.** Only the Style template | |
| is trained as a global transform. Anything else global (LUT-style | |
| adjustments, time-of-day swaps without a concrete object) is unreliable. | |
| ### Which LoRA should I use? | |
| | If you want⦠| Use | | |
| |---|---| | |
| | Motion transfer (edit first frame externally, model copies motion) | **v0.1 motion transfer** | | |
| | Multi-task edits (add / remove / replace / style) driven only by prompt | **no-ref multitask r256** (standalone) | | |
| | Strong identity transfer from a reference image (Add / Replace) | **Ref V2V** | | |
| ### How to wire the LoopingSampler | |
| A single **standard LoRA**, no sidecar, no stacking. Load through one | |
| ComfyUI LoraLoader before the LoopingSampler. On the sampler: | |
| - `editanything_module` β **leave disconnected**. | |
| - `ref_image` β **leave disconnected**. This LoRA has no reference-image | |
| path; passing one will just pre-encode tokens that you do not want. | |
| - `guide_frames` β the guide video. | |
| - `enable_role_embedding`, `enable_adaln`, `enable_visual_crossattn` β | |
| all **off** (no module = nothing to inject anyway). | |
| --- | |
| ## 3. Reference video-to-video (Ref V2V) β experimental | |
| Files (two builds of the same LoRA family β each ships as a `(.standard, .module)` pair): | |
| - `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.standard.safetensors` | |
| - `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.module.safetensors` | |
| - `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.standard.safetensors` | |
| - `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.module.safetensors` | |
| ### What it is | |
| The goal is **add / replace using a reference image** β same vibe as Edit | |
| Anything v1.0, but with an explicit image as the appearance source instead | |
| of relying only on the prompt. | |
| Trained on **~1600** Add / Replace video pairs. Reference-paired video | |
| datasets are basically nonexistent, so the dataset had to be built from | |
| scratch β that is why the sample count is small. **It often fails.** This | |
| is fully experimental; thousands of training runs went into landing on this | |
| LoRA layout, and it is still unclear how much it actually helps. | |
| ### Architecture β why this LoRA has "modules" | |
| Trained as a conventional IC-LoRA, plus extra projection branches that try | |
| to make the reference signal survive across layers: | |
| - **`ref_visual_proj`** β projects the reference VAE latent into 32 visual | |
| memory tokens. | |
| - **`ref_attn`** β a dedicated cross-attention branch inside each | |
| transformer block, reading those tokens. | |
| - **`ref_adaln_proj`** β a global AdaLN bias derived from the reference | |
| (palette / overall look). | |
| - **`role_embedding`** β an experimental token bias inspired by some of | |
| Kijai's tests; whether it actually helps is still unclear. | |
| These extra weights are saved alongside the LoRA in a `.module.safetensors` | |
| sidecar because they are **not standard LoRA adapters** β the regular | |
| ComfyUI LoRA loader can't consume them, so they need a dedicated node. | |
| ### How to load | |
| | File | What it is | Where it goes | | |
| |---|---|---| | |
| | `*.standard.safetensors` | LoRA on `attn1` / `attn2` / `ff` only | Standard ComfyUI LoRA loader | | |
| | `*.module.safetensors` | `role_embedding`, `ref_adaln_proj`, `ref_visual_proj`, `ref_attn` LoRA adapters | `LTXVEditAnythingModuleLoader` (BFSnodes) | | |
| Both files of a pair must be loaded **together** β the LoRA was trained | |
| against the sidecar adapters and they only make sense as a unit. Do not mix | |
| `.standard` from one build with `.module` from another. | |
| The module file is consumed by the **`π π £π § LTXV Edit Anything Looping | |
| Sampler`** node, which was written specifically to: | |
| 1. Install the `ref_attn` cross-attention branch on every transformer block. | |
| 2. Inject the AdaLN / role / visual cross-attention conditioning at the | |
| correct points in the model. | |
| 3. Sample long videos in overlapping chunks with the conditioning re-applied | |
| per chunk. | |
| ### How to wire the LoopingSampler | |
| - Load the `*.standard.safetensors` through a normal ComfyUI LoraLoader | |
| before the sampler. | |
| - Load the `*.module.safetensors` through `LTXVEditAnythingModuleLoader` | |
| and connect its `editanything_module` output to the sampler. | |
| - On the sampler: | |
| - `editanything_module` β **the module loader output** (required). | |
| - `ref_image` β **the reference image** (required β this is what | |
| `Add` / `Replace` will insert). | |
| - `guide_frames` β the source video to edit. | |
| - `enable_adaln` β **on** (defaults match training). | |
| - `enable_visual_crossattn` β **on** for the 4-extras build; off (or | |
| will be a no-op) for the 2-extras build. | |
| - `enable_role_embedding` β **off** for the 4-extras build (training | |
| config disabled it). On if you're loading the 2-extras build alone. | |
| Missing `ref_image` here silently disables AdaLN and the visual | |
| cross-attention β the sampler will warn in the log. | |
| ### Which build to use | |
| - **`ref_adaln_proj-role_embedding`** β the original training. Only ships | |
| the two side-channel modules. | |
| - **`ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`** β the | |
| continuation. Adds the visual cross-attention branch and its projector on | |
| top. | |
| It is genuinely **not clear yet** whether the extra branches help over the | |
| plain LoRA. Both builds are honest experiments. Try both, decide for your | |
| own use case, and please share findings. | |
| ### Reading the layers | |
| For anyone who wants to understand what each layer in the Ref V2V | |
| checkpoint does: | |
| - [`lora_layers_reference.md`](./lora_layers_reference.md) β full tensor | |
| inventory of both builds. | |
| - [`lora_layers_impact.md`](./lora_layers_impact.md) β what each branch | |
| contributes at inference and which inference knob (`adaln_scale`, | |
| `ref_context_scale`, `ref_token_scale`, `ref_start_block`, | |
| `ref_end_block`, etc.) maps back to which training default. | |
| --- | |
| ## Prompt examples | |
| The two LoRAs were trained on very different caption styles. Match the | |
| style of whichever LoRA you're using β straying outside the training | |
| distribution is the fastest way to get garbage out. | |
| ### Edit Anything v0.1 β standard editing | |
| The stage-1 dataset uses short imperative captions describing one or two | |
| edits. Use the same shape at inference. Examples drawn from the training | |
| distribution: | |
| - *"Replace the stone statue of a man on the left with a young woman in a | |
| green dress."* | |
| - *"Add a black labrador retriever sitting beside the woman on the bench."* | |
| - *"Remove the teacher from the classroom."* | |
| - *"Alter the cap's colour from modern black to deep maroon."* | |
| - *"Replace the fresh citrus-green background with a wooden desk."* | |
| - *"Add faint tire tracks across the snow behind the car."* | |
| - *"Add a black statue, a blue camera, a cyan towel, a red guitar and a | |
| pink backpack to the lakeside pier."* | |
| Tips: | |
| - Imperative verbs: **Add / Replace / Remove / Alter / Change**. | |
| - When replacing, **describe both** the original and the new subject so the | |
| model can localise the edit. | |
| - Keep captions short and concrete. Long flowery prose hurts. | |
| ### Edit Anything v0.1 β motion transfer | |
| Workflow: | |
| 1. Pick a guide video. | |
| 2. Edit **only the first frame** externally (Flux Klein or any | |
| capable image-edit model) to introduce the new subject in the desired | |
| pose and position. | |
| 3. Feed the edited frame as the first frame of the input and the original | |
| guide as motion source. | |
| 4. The prompt should describe **the inserted subject and the action being | |
| preserved**. | |
| Examples: | |
| - *"Replace the standing man holding the umbrella with a woman in a red | |
| coat holding the same umbrella, walking across the puddles."* | |
| - *"Add a tabby cat curled up in the armchair while the man in the | |
| background keeps reading."* | |
| - *"Replace the runner in the blue jersey with a man wearing a white shirt | |
| and grey shorts running along the same path."* | |
| Limits: fast or chaotic motion will fail; the inserted subject should | |
| occupy roughly the same region/scale as what it replaces. | |
| ### Reference V2V (Ref V2V) β Add and Replace | |
| These captions are real samples from the ~1600-pair training set. They | |
| describe the **target scene after the edit** in detail. The reference | |
| image carries the *appearance* of the inserted subject; the caption | |
| carries *position, pose, action, and surrounding context*. | |
| **Add task** (the reference image holds the new subject): | |
| - *"Add a middle-aged man with curly grey hair, a beard and glasses, | |
| wearing a blue quarter-zip sweater, on the right side of the frame, | |
| standing in front of a raw cut of meat on a tray."* | |
| - *"Add a light-coloured small boat with dark seats and an outboard motor | |
| floating in the water."* | |
| - *"Add an open book filled with colourful pencils in the woman's hands."* | |
| - *"Add a silver metallic bucket on the table in front of the blonde | |
| character, with her hands stirring a mixture inside."* | |
| - *"Add two miniature dolls, one blonde and one brunette, dressed in | |
| patterned clothing, sitting at a small table with teacups and small | |
| white vases on the countertop."* | |
| **Replace task** (the reference image holds the new subject; the caption | |
| also describes what is being replaced): | |
| - *"Replace the standing kangaroo holding the bicycle handlebars with a | |
| man wearing a white t-shirt, light brown shorts and a yellow cap, | |
| holding the bicycle handlebars."* | |
| - *"Replace the stone statue of a man on the left side with a young woman | |
| in a green dress."* | |
| - *"Replace the wooden barrel near the entrance with a large brown leather | |
| suitcase."* | |
| Tips for Ref V2V: | |
| - **Describe the inserted subject in full**, even though the reference | |
| image is the source of truth β the text path drives placement and pose. | |
| - For *Replace*, **also describe what is being replaced** so the model can | |
| match the spatial region. | |
| - Keep the inserted subject roughly in the same scale and region as what | |
| it replaces. | |
| - The captions in the training set average ~25β40 words β aim for that | |
| range. Single-sentence captions like *"Add a man"* are far too sparse | |
| and will fail. | |
| --- | |
| ## Inference tips (applies to all models) | |
| **CFG matters a lot here.** The default workflow runs with the LTX-2.3 | |
| **distilled / acceleration LoRAs** for fast 4β8 step sampling, which | |
| locks **CFG = 1.0**. That's fine for casual runs, but at CFG 1 the model | |
| follows the prompt loosely β you get the reference image to "show up" | |
| but the edit instruction itself is only weakly enforced. | |
| **For harder prompts, raise CFG above 1.0.** This means dropping (or | |
| weakening) the distilled / acceleration LoRAs and going back to a normal | |
| sampler with more steps β significantly slower, but the model follows | |
| the prompt much more closely. Trade-off: | |
| Other knobs: | |
| - If the model is **ignoring the prompt** (edit isn't being applied, the | |
| reference is barely showing up, the style transfer is faint), raising | |
| CFG is the single most common fix. Go up to 6β8 if needed. | |
| - If results look **over-saturated, plasticky, or motion is freezing**, | |
| CFG is too high β pull back toward 3β4 or re-enable the distilled LoRA | |
| for CFG 1 if you don't actually need stronger prompt adherence. | |
| - Ref V2V in particular benefits from being more aggressive with CFG when | |
| the reference identity isn't transferring cleanly. | |
| - Combine CFG tuning with the LoRA-specific knobs from each section | |
| (`adaln_scale`, `ref_context_scale`, `ref_token_scale` for Ref V2V; | |
| prompt rewriting for v0.1 / no-ref). | |
| Treat CFG as a real knob, not a constant β and be ready to give up some | |
| speed when you actually need the edit to land. | |
| --- | |
| ## ComfyUI nodes | |
| All recommended inference paths run through the **BFSnodes** custom node | |
| set. For now BFSnodes is the only place these nodes live; once they | |
| stabilise they may move elsewhere. | |
| Specific nodes used by these LoRAs: | |
| - `LTXVEditAnythingApply` β load the LoRA + extras and patch the model. | |
| - `π π £π § LTXV Edit Anything Looping Sampler` β sampler that injects role / | |
| AdaLN / visual cross-attention and handles long videos in chunks. | |
| - `LTXVEditAnythingModuleLoader` β load the `*.module.safetensors` sidecar. | |
| --- | |
| ## Status | |
| Released as experimental research artefacts. Expect failures, do not | |
| deploy, and please report what works and what doesn't. | |
| --- | |
| ## Credits | |
| If you use these models β in a project, a demo, a paper, a video, a tweet, | |
| a workflow, anything β **please credit my work**. These checkpoints are the | |
| result of weeks of research, dataset building, and training runs, and that | |
| effort is what makes any of it usable. Crediting the source is the bare | |
| minimum that keeps open research like this sustainable. | |
| **Author:** Alisson Pereira dos Anjos ([@Alissonerdx](https://huggingface.co/Alissonerdx)) | |
| Suggested attribution: | |
| > Edit Anything LoRAs by Alisson Pereira dos Anjos | |
| > ([huggingface.co/Alissonerdx/EditAnything](https://huggingface.co/Alissonerdx/EditAnything)). | |
| Links back to this repository are appreciated wherever you publish results. | |