Update README.md

150fc4a verified 2 days ago

23.8 kB

	---
	license: apache-2.0
	library_name: diffusers
	base_model: Lightricks/LTX-2.3
	tags:
	- lora
	- video
	- video-editing
	- ltx-2.3
	---

	# Edit Anything — Experimental LTX-2 Video Editing LoRAs

	> Heads up. These LoRAs are research experiments. They are far from
	> production-ready and will fail on many inputs. They are released for the
	> community to play with and break, not as a finished tool.

	This repository hosts three unrelated training tracks built on top of
	LTX-2.3 (22B) for video editing:

	1. Edit Anything v0.1 — motion transfer LoRA (two ranks).
	2. Edit Anything — no-reference multitask LoRA (rank 256, prompt-driven only).
	3. Reference video-to-video (Ref V2V) — experimental IC-LoRA + sidecar modules (two builds).

	Inference is meant to run through the BFSnodes ComfyUI custom nodes —
	the Ref V2V build in particular needs them to load the sidecar modules and
	install the custom branches into the transformer.

	---

	## 1. Edit Anything v0.1 (motion transfer)

	Files:

	- `edit_anything_30k_v0.1_motion_transfer_r128.safetensors`
	- `edit_anything_30k_v0.1_motion_transfer_r256.safetensors`

	### What it is

	v0.1 is not a direct continuation of v1.0. It was trained from scratch
	in two stages:

	1. Stage 1 — image-only pretraining. ~30 000 image edit pairs. Training
	a video model on still images is admittedly not ideal, but it was a way
	to push the editing vocabulary beyond what a small video-only dataset can
	teach.
	2. Stage 2 — video fine-tune with `first_frame_conditioning > 0`. This
	restored the temporal prior and unlocked the motion-transfer behaviour
	described below.

	In theory v0.1 can do the same edits as v1.0, but **temporal consistency may
	be weaker than v1.0** because so much of stage 1 happened on still images.
	Test against v1.0 case-by-case before assuming v0.1 wins on your task.

	### Motion transfer

	Because stage 2 included first-frame conditioning, you can drive the LoRA
	into a motion-transfer mode:

	1. Take a guide video.
	2. Replace its first frame with an edited still (insert a new subject,
	swap an object, etc.). Use a strong image-editing model — Flux Klein
	or similar — to prepare it; the quality of this single frame propagates
	through the whole clip.
	3. Feed the edited frame as the first frame of the input, and the original
	guide video as the motion source.

	The model uses the new first frame as the appearance anchor and copies the
	motion from the rest of the guide.

	Limitations (these are real, not theoretical — expect them to bite):

	- Hard scene cuts break it. The model assumes continuous motion from
	the first frame onwards. A cut to a different camera angle or location
	mid-clip will produce smearing, ghosting, or the inserted subject jumping
	to the wrong position. Use clips without cuts, or split at the cuts and
	process each segment separately.
	- Very fast motion fails. Quick pans, fast subject movement, or
	high-velocity action confuse the motion-copy mechanism. Outputs degrade
	to blur or to the model "freezing" on the first-frame appearance and
	losing the motion entirely. Stick to moderate-speed clips.
	- Poor blending / artefacts in the first frame propagate everywhere.
	- Works best when the inserted subject roughly occupies the same region as
	whatever it replaces.

	### Prompting

	Prompt is just as critical as in v1.0. **Describe both the object being
	replaced and the new one in detail*. Example: "Replace the bronze statue on
	the left with a tall man wearing a navy raincoat and brown boots."* Vague
	prompts produce bad edits.

	### Which rank to use

	The same training produced both files. v0.1 is actually the merge of the
	two-stage training (one LoRA per stage), re-extracted at two different ranks
	via Frobenius-optimal truncated SVD:

	\| File \| Rank \| Size \| Frobenius retention \|
	\|---\|---\|---\|---\|
	\| `edit_anything_30k_v0.1_motion_transfer_r128.safetensors` \| 128 \| 1.31 GB \| ~99.4% \|
	\| `edit_anything_30k_v0.1_motion_transfer_r256.safetensors` \| 256 \| 2.62 GB \| ~99.9% \|

	r256 is closer to the merged source. r128 is normally indistinguishable in
	practice. Pick whichever fits your workflow.

	### How to wire the LoopingSampler

	This is a standard LoRA, not a sidecar. Load it through the regular
	ComfyUI LoraLoader before the LoopingSampler. On the sampler itself:

	- `editanything_module` → leave disconnected.
	- `ref_image` → the edited first frame (for motion transfer) or the
	source frame you want preserved (for plain editing).
	- `guide_frames` → the guide video.
	- `enable_role_embedding`, `enable_adaln`, `enable_visual_crossattn` →
	all off. None of those branches were trained for v0.1; turning them
	on with no module connected does nothing anyway, but keeping them off
	silences the WARN logs.

	---

	## 2. Edit Anything — no-reference multitask LoRA

	File:

	- `edit_anything_v1.1_r256.safetensors`

	### What it is

	A prompt-only multitask editing LoRA. No reference image, no first-frame
	conditioning — the model is driven entirely by the text prompt and the
	guide video. Trained on a balanced mix of Add, Remove, Replace, Style
	edits.

	### What it's different about it (vs v0.1)

	The task vocabulary overlaps heavily with v0.1 — both can do Add, Remove,
	Replace, Change, Convert. What changes here:

	- Two-stage training continuation: the first stage gave the model its
	edit vocabulary; the second stage refined it on a larger, more balanced
	video pair set covering Add / Remove / Replace / Style.
	- Rank 256 (vs v0.1's effective rank from the merge), giving more
	capacity for the broader task mix.
	- Trained directly on video pairs, so the temporal behaviour on these
	tasks tends to be steadier than on a model whose first stage was on
	still images.

	### How to use it

	Standalone — load it as a regular LoRA on vanilla LTX-2.3 through any
	ComfyUI LoRA loader. The file already carries everything it needs; no
	stacking with v0.1, no companion module.

	### Limitations

	- No reference image → identity is not anchored, so Add / Replace of a
	specific person or object will be wobblier than the Ref V2V build.
	- No motion transfer (that's v0.1 only).

	### Prompting

	Same imperative shape as v0.1, but the training set split into four very
	distinct caption styles. Match the one that fits the edit you want — the
	distribution is narrow and the model expects the right shape.

	The training set is roughly balanced across **Add, Remove, Replace and
	Style** buckets, with Style being the smallest of the four. Captions
	below are real examples drawn from those buckets.

	#### Add — 15 to 30+ words, describe what to add and where

	* `Add a smiling woman with brown hair, wearing a pink sleeveless top, sitting to the right of the man at the news desk.`
	* `Add a person wearing a blue denim shirt over a white t-shirt to the right side of the frame, behind the person cooking.`
	* `Add a decorated Christmas tree with red and white ornaments and lights to the right of the man.`
	* `Add a blonde boy wearing a black t-shirt with a blue collar and blue patterned pants, sitting behind the other children in the upper center of the frame.`
	* `Add two horizontal wooden strips to the front of the white range hood.`

	Pattern: `Add <detailed subject description>, <position in frame>, <surrounding context>.`

	#### Remove — very short, 4 to 10 words

	* `Remove the man drinking from a glass.`
	* `Remove the disco ball.`
	* `Remove the large tree on the right.`
	* `Remove the squirrel in the foreground.`
	* `Remove the man on the left.`

	Pattern: `Remove the <object>` (+ optional position). Resist the urge to
	over-describe — long Remove prompts drift outside the training shape and
	often fail.

	#### Replace — 20 to 35 words, describe both old and new

	* `Replace the white panel door on the right side of the frame with a dark brown grandfather clock.`
	* `Replace the light-colored cat lying on the mat on the floor with a young woman sitting on the mat.`
	* `Replace the dark grey knitted sweater on the man's torso with a black and white patterned Christmas sweater.`
	* `Replace the blue robot with a glowing blue face on the left with a smiling man wearing sunglasses and a blue shirt.`
	* `Replace the sitting person wearing a black cape on the left with a black fabric draped over an object.`

	Pattern: `Replace <description of the original subject and its location> with <description of the new subject>.`

	#### Style — fixed template, the style name is what changes

	* `Convert the video into a Pencil Sketch style.`
	* `Convert the video into a Watercolor Painting style.`
	* `Convert the video into a Van Gogh style.`
	* `Convert the video into a Play-Doh style.`
	* `Convert the video into a Claymation style.`
	* `Convert the video into a 3D Chibi style.`
	* `Convert the video into a Ghibli style.`
	* `Convert the video into a Pop Art style.`
	* `Convert the video into an American Cartoon style.`
	* `Convert the video into a Flat Vector Cartoon style.`

	The training set covers 300+ distinct style names. Many work; many do
	not. The list above is heavily represented in training. Use the exact
	phrase `Convert the video into a <STYLE> style` — deviations from this
	template degrade quality noticeably.

	#### What it does not do

	These are honest limits of the training distribution — don't expect them
	to work just because the model is multitask:

	- No compositional prompts. "Add X and remove Y", *"Replace A with B
	and add C", etc. are not* in the training set. Captions combining
	two action verbs are essentially absent (the only ones present are the
	"Remove X and replace with Y" idiom, which is really a single Replace).
	Pure multi-action edits will fall apart — split them into separate runs.
	- No "change background" as a task. Background is only used as a
	positional reference ("in the background", "on the wall in the
	background"). To swap the entire backdrop, phrase it as a Replace
	on a concrete background element, e.g.
	"Replace the brick wall in the background with a forest at sunset".
	Vague prompts like "Change the background to a beach" are off-
	distribution and rarely work.
	- No global colour grade / lighting change. Only the Style template
	is trained as a global transform. Anything else global (LUT-style
	adjustments, time-of-day swaps without a concrete object) is unreliable.

	### Which LoRA should I use?

	\| If you want… \| Use \|
	\|---\|---\|
	\| Motion transfer (edit first frame externally, model copies motion) \| v0.1 motion transfer \|
	\| Multi-task edits (add / remove / replace / style) driven only by prompt \| no-ref multitask r256 (standalone) \|
	\| Strong identity transfer from a reference image (Add / Replace) \| Ref V2V \|

	### How to wire the LoopingSampler

	A single standard LoRA, no sidecar, no stacking. Load through one
	ComfyUI LoraLoader before the LoopingSampler. On the sampler:

	- `editanything_module` → leave disconnected.
	- `ref_image` → leave disconnected. This LoRA has no reference-image
	path; passing one will just pre-encode tokens that you do not want.
	- `guide_frames` → the guide video.
	- `enable_role_embedding`, `enable_adaln`, `enable_visual_crossattn` →
	all off (no module = nothing to inject anyway).

	---

	## 3. Reference video-to-video (Ref V2V) — experimental

	Files (two builds of the same LoRA family — each ships as a `(.standard, .module)` pair):

	- `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.standard.safetensors`
	- `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.module.safetensors`
	- `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.standard.safetensors`
	- `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.module.safetensors`

	### What it is

	The goal is add / replace using a reference image — same vibe as Edit
	Anything v1.0, but with an explicit image as the appearance source instead
	of relying only on the prompt.

	Trained on ~1600 Add / Replace video pairs. Reference-paired video
	datasets are basically nonexistent, so the dataset had to be built from
	scratch — that is why the sample count is small. It often fails. This
	is fully experimental; thousands of training runs went into landing on this
	LoRA layout, and it is still unclear how much it actually helps.

	### Architecture — why this LoRA has "modules"

	Trained as a conventional IC-LoRA, plus extra projection branches that try
	to make the reference signal survive across layers:

	- `ref_visual_proj` — projects the reference VAE latent into 32 visual
	memory tokens.
	- `ref_attn` — a dedicated cross-attention branch inside each
	transformer block, reading those tokens.
	- `ref_adaln_proj` — a global AdaLN bias derived from the reference
	(palette / overall look).
	- `role_embedding` — an experimental token bias inspired by some of
	Kijai's tests; whether it actually helps is still unclear.

	These extra weights are saved alongside the LoRA in a `.module.safetensors`
	sidecar because they are not standard LoRA adapters — the regular
	ComfyUI LoRA loader can't consume them, so they need a dedicated node.

	### How to load

	\| File \| What it is \| Where it goes \|
	\|---\|---\|---\|
	\| `*.standard.safetensors` \| LoRA on `attn1` / `attn2` / `ff` only \| Standard ComfyUI LoRA loader \|
	\| `*.module.safetensors` \| `role_embedding`, `ref_adaln_proj`, `ref_visual_proj`, `ref_attn` LoRA adapters \| `LTXVEditAnythingModuleLoader` (BFSnodes) \|

	Both files of a pair must be loaded together — the LoRA was trained
	against the sidecar adapters and they only make sense as a unit. Do not mix
	`.standard` from one build with `.module` from another.

	The module file is consumed by the **`🅛🅣🅧 LTXV Edit Anything Looping
	Sampler`** node, which was written specifically to:

	1. Install the `ref_attn` cross-attention branch on every transformer block.
	2. Inject the AdaLN / role / visual cross-attention conditioning at the
	correct points in the model.
	3. Sample long videos in overlapping chunks with the conditioning re-applied
	per chunk.

	### How to wire the LoopingSampler

	- Load the `*.standard.safetensors` through a normal ComfyUI LoraLoader
	before the sampler.
	- Load the `*.module.safetensors` through `LTXVEditAnythingModuleLoader`
	and connect its `editanything_module` output to the sampler.
	- On the sampler:
	- `editanything_module` → the module loader output (required).
	- `ref_image` → the reference image (required — this is what
	`Add` / `Replace` will insert).
	- `guide_frames` → the source video to edit.
	- `enable_adaln` → on (defaults match training).
	- `enable_visual_crossattn` → on for the 4-extras build; off (or
	will be a no-op) for the 2-extras build.
	- `enable_role_embedding` → off for the 4-extras build (training
	config disabled it). On if you're loading the 2-extras build alone.

	Missing `ref_image` here silently disables AdaLN and the visual
	cross-attention — the sampler will warn in the log.

	### Which build to use

	- `ref_adaln_proj-role_embedding` — the original training. Only ships
	the two side-channel modules.
	- `ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj` — the
	continuation. Adds the visual cross-attention branch and its projector on
	top.

	It is genuinely not clear yet whether the extra branches help over the
	plain LoRA. Both builds are honest experiments. Try both, decide for your
	own use case, and please share findings.

	### Reading the layers

	For anyone who wants to understand what each layer in the Ref V2V
	checkpoint does:

	- [`lora_layers_reference.md`](./lora_layers_reference.md) — full tensor
	inventory of both builds.
	- [`lora_layers_impact.md`](./lora_layers_impact.md) — what each branch
	contributes at inference and which inference knob (`adaln_scale`,
	`ref_context_scale`, `ref_token_scale`, `ref_start_block`,
	`ref_end_block`, etc.) maps back to which training default.

	---

	## Prompt examples

	The two LoRAs were trained on very different caption styles. Match the
	style of whichever LoRA you're using — straying outside the training
	distribution is the fastest way to get garbage out.

	### Edit Anything v0.1 — standard editing

	The stage-1 dataset uses short imperative captions describing one or two
	edits. Use the same shape at inference. Examples drawn from the training
	distribution:

	- *"Replace the stone statue of a man on the left with a young woman in a
	green dress."*
	- "Add a black labrador retriever sitting beside the woman on the bench."
	- "Remove the teacher from the classroom."
	- "Alter the cap's colour from modern black to deep maroon."
	- "Replace the fresh citrus-green background with a wooden desk."
	- "Add faint tire tracks across the snow behind the car."
	- *"Add a black statue, a blue camera, a cyan towel, a red guitar and a
	pink backpack to the lakeside pier."*

	Tips:

	- Imperative verbs: Add / Replace / Remove / Alter / Change.
	- When replacing, describe both the original and the new subject so the
	model can localise the edit.
	- Keep captions short and concrete. Long flowery prose hurts.

	### Edit Anything v0.1 — motion transfer

	Workflow:

	1. Pick a guide video.
	2. Edit only the first frame externally (Flux Klein or any
	capable image-edit model) to introduce the new subject in the desired
	pose and position.
	3. Feed the edited frame as the first frame of the input and the original
	guide as motion source.
	4. The prompt should describe **the inserted subject and the action being
	preserved**.

	Examples:

	- *"Replace the standing man holding the umbrella with a woman in a red
	coat holding the same umbrella, walking across the puddles."*
	- *"Add a tabby cat curled up in the armchair while the man in the
	background keeps reading."*
	- *"Replace the runner in the blue jersey with a man wearing a white shirt
	and grey shorts running along the same path."*

	Limits: fast or chaotic motion will fail; the inserted subject should
	occupy roughly the same region/scale as what it replaces.

	### Reference V2V (Ref V2V) — Add and Replace

	These captions are real samples from the ~1600-pair training set. They
	describe the target scene after the edit in detail. The reference
	image carries the appearance of the inserted subject; the caption
	carries position, pose, action, and surrounding context.

	Add task (the reference image holds the new subject):

	- *"Add a middle-aged man with curly grey hair, a beard and glasses,
	wearing a blue quarter-zip sweater, on the right side of the frame,
	standing in front of a raw cut of meat on a tray."*
	- *"Add a light-coloured small boat with dark seats and an outboard motor
	floating in the water."*
	- "Add an open book filled with colourful pencils in the woman's hands."
	- *"Add a silver metallic bucket on the table in front of the blonde
	character, with her hands stirring a mixture inside."*
	- *"Add two miniature dolls, one blonde and one brunette, dressed in
	patterned clothing, sitting at a small table with teacups and small
	white vases on the countertop."*

	Replace task (the reference image holds the new subject; the caption
	also describes what is being replaced):

	- *"Replace the standing kangaroo holding the bicycle handlebars with a
	man wearing a white t-shirt, light brown shorts and a yellow cap,
	holding the bicycle handlebars."*
	- *"Replace the stone statue of a man on the left side with a young woman
	in a green dress."*
	- *"Replace the wooden barrel near the entrance with a large brown leather
	suitcase."*

	Tips for Ref V2V:

	- Describe the inserted subject in full, even though the reference
	image is the source of truth — the text path drives placement and pose.
	- For Replace, also describe what is being replaced so the model can
	match the spatial region.
	- Keep the inserted subject roughly in the same scale and region as what
	it replaces.
	- The captions in the training set average ~25–40 words — aim for that
	range. Single-sentence captions like "Add a man" are far too sparse
	and will fail.

	---

	## Inference tips (applies to all models)

	CFG matters a lot here. The default workflow runs with the LTX-2.3
	distilled / acceleration LoRAs for fast 4–8 step sampling, which
	locks CFG = 1.0. That's fine for casual runs, but at CFG 1 the model
	follows the prompt loosely — you get the reference image to "show up"
	but the edit instruction itself is only weakly enforced.

	For harder prompts, raise CFG above 1.0. This means dropping (or
	weakening) the distilled / acceleration LoRAs and going back to a normal
	sampler with more steps — significantly slower, but the model follows
	the prompt much more closely. Trade-off:

	Other knobs:

	- If the model is ignoring the prompt (edit isn't being applied, the
	reference is barely showing up, the style transfer is faint), raising
	CFG is the single most common fix. Go up to 6–8 if needed.
	- If results look over-saturated, plasticky, or motion is freezing,
	CFG is too high — pull back toward 3–4 or re-enable the distilled LoRA
	for CFG 1 if you don't actually need stronger prompt adherence.
	- Ref V2V in particular benefits from being more aggressive with CFG when
	the reference identity isn't transferring cleanly.
	- Combine CFG tuning with the LoRA-specific knobs from each section
	(`adaln_scale`, `ref_context_scale`, `ref_token_scale` for Ref V2V;
	prompt rewriting for v0.1 / no-ref).

	Treat CFG as a real knob, not a constant — and be ready to give up some
	speed when you actually need the edit to land.

	---

	## ComfyUI nodes

	All recommended inference paths run through the BFSnodes custom node
	set. For now BFSnodes is the only place these nodes live; once they
	stabilise they may move elsewhere.

	Specific nodes used by these LoRAs:

	- `LTXVEditAnythingApply` — load the LoRA + extras and patch the model.
	- `🅛🅣🅧 LTXV Edit Anything Looping Sampler` — sampler that injects role /
	AdaLN / visual cross-attention and handles long videos in chunks.
	- `LTXVEditAnythingModuleLoader` — load the `*.module.safetensors` sidecar.

	---

	## Status

	Released as experimental research artefacts. Expect failures, do not
	deploy, and please report what works and what doesn't.

	---

	## Credits

	If you use these models — in a project, a demo, a paper, a video, a tweet,
	a workflow, anything — please credit my work. These checkpoints are the
	result of weeks of research, dataset building, and training runs, and that
	effort is what makes any of it usable. Crediting the source is the bare
	minimum that keeps open research like this sustainable.

	Author: Alisson Pereira dos Anjos ([@Alissonerdx](https://huggingface.co/Alissonerdx))

	Suggested attribution:

	> Edit Anything LoRAs by Alisson Pereira dos Anjos
	> ([huggingface.co/Alissonerdx/EditAnything](https://huggingface.co/Alissonerdx/EditAnything)).

	Links back to this repository are appreciated wherever you publish results.