Alissonerdx commited on
Commit
dbe9157
Β·
verified Β·
1 Parent(s): df0e628

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +318 -0
README.md CHANGED
@@ -1,3 +1,321 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ library_name: diffusers
4
+ base_model: Lightricks/LTX-2.3
5
+ tags:
6
+ - lora
7
+ - video
8
+ - video-editing
9
+ - ltxv
10
+ - ltx-2.3
11
  ---
12
+
13
+ # Edit Anything β€” Experimental LTX-2 Video Editing LoRAs
14
+
15
+ > **Heads up.** These LoRAs are research experiments. They are far from
16
+ > production-ready and will fail on many inputs. They are released for the
17
+ > community to play with and break, not as a finished tool.
18
+
19
+ This repository hosts two unrelated training tracks built on top of
20
+ **LTX-2.3 (22B)** for video editing:
21
+
22
+ 1. **Edit Anything v1.1 β€” motion transfer LoRA** (two ranks).
23
+ 2. **Reference video-to-video (Ref V2V) β€” experimental IC-LoRA + sidecar modules** (two builds).
24
+
25
+ Inference is meant to run through the **BFSnodes** ComfyUI custom nodes β€”
26
+ the Ref V2V build in particular needs them to load the sidecar modules and
27
+ install the custom branches into the transformer.
28
+
29
+ ---
30
+
31
+ ## 1. Edit Anything v1.1 (motion transfer)
32
+
33
+ Files:
34
+
35
+ - `edit_anything_30k_v1.1_motion_transfer_r128.safetensors`
36
+ - `edit_anything_30k_v1.1_motion_transfer_r256.safetensors`
37
+
38
+ ### What it is
39
+
40
+ **v1.1 is not a direct continuation of v1.0.** It was trained from scratch
41
+ in two stages:
42
+
43
+ 1. **Stage 1 β€” image-only pretraining.** ~30 000 image edit pairs. Training
44
+ a *video* model on still images is admittedly not ideal, but it was a way
45
+ to push the editing vocabulary beyond what a small video-only dataset can
46
+ teach.
47
+ 2. **Stage 2 β€” video fine-tune with `first_frame_conditioning > 0`.** This
48
+ restored the temporal prior and unlocked the motion-transfer behaviour
49
+ described below.
50
+
51
+ In theory v1.1 can do the same edits as v1.0, but **temporal consistency may
52
+ be weaker than v1.0** because so much of stage 1 happened on still images.
53
+ Test against v1.0 case-by-case before assuming v1.1 wins on your task.
54
+
55
+ ### Motion transfer
56
+
57
+ Because stage 2 included first-frame conditioning, you can drive the LoRA
58
+ into a motion-transfer mode:
59
+
60
+ 1. Take a guide video.
61
+ 2. **Replace its first frame** with an edited still (insert a new subject,
62
+ swap an object, etc.). Use a strong image-editing model β€” Flux Kontext /
63
+ "Klein" or similar β€” to prepare it; the quality of this single frame
64
+ propagates through the whole clip.
65
+ 3. Feed the edited frame as the first frame of the input, and the original
66
+ guide video as the motion source.
67
+
68
+ The model uses the new first frame as the appearance anchor and copies the
69
+ motion from the rest of the guide.
70
+
71
+ Limitations:
72
+
73
+ - Fast or chaotic motion β†’ fails.
74
+ - Poor blending / artefacts in the first frame propagate everywhere.
75
+ - Works best when the inserted subject roughly occupies the same region as
76
+ whatever it replaces.
77
+
78
+ ### Prompting
79
+
80
+ Prompt is just as critical as in v1.0. **Describe both the object being
81
+ replaced and the new one in detail**. Example: *"Replace the bronze statue on
82
+ the left with a tall man wearing a navy raincoat and brown boots."* Vague
83
+ prompts produce bad edits.
84
+
85
+ ### Which rank to use
86
+
87
+ The same training produced both files. v1.1 is actually the merge of the
88
+ two-stage training (one LoRA per stage), re-extracted at two different ranks
89
+ via Frobenius-optimal truncated SVD:
90
+
91
+ | File | Rank | Size | Frobenius retention |
92
+ |---|---|---|---|
93
+ | `edit_anything_30k_v1.1_motion_transfer_r128.safetensors` | 128 | 1.31 GB | ~99.4% |
94
+ | `edit_anything_30k_v1.1_motion_transfer_r256.safetensors` | 256 | 2.62 GB | ~99.9% |
95
+
96
+ r256 is closer to the merged source. r128 is normally indistinguishable in
97
+ practice. Pick whichever fits your workflow.
98
+
99
+ ---
100
+
101
+ ## 2. Reference video-to-video (Ref V2V) β€” experimental
102
+
103
+ Files (two builds of the same LoRA family β€” each ships as a `(.standard, .module)` pair):
104
+
105
+ - `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.standard.safetensors`
106
+ - `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding.module.safetensors`
107
+ - `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.standard.safetensors`
108
+ - `edit_anything_reference_v0.1_r128_ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj.module.safetensors`
109
+
110
+ ### What it is
111
+
112
+ The goal is **add / replace using a reference image** β€” same vibe as Edit
113
+ Anything v1.0, but with an explicit image as the appearance source instead
114
+ of relying only on the prompt.
115
+
116
+ Trained on **~1600** Add / Replace video pairs. Reference-paired video
117
+ datasets are basically nonexistent, so the dataset had to be built from
118
+ scratch β€” that is why the sample count is small. **It often fails.** This
119
+ is fully experimental; thousands of training runs went into landing on this
120
+ LoRA layout, and it is still unclear how much it actually helps.
121
+
122
+ ### Architecture β€” why this LoRA has "modules"
123
+
124
+ Trained as a conventional IC-LoRA, plus extra projection branches that try
125
+ to make the reference signal survive across layers:
126
+
127
+ - **`ref_visual_proj`** β€” projects the reference VAE latent into 32 visual
128
+ memory tokens.
129
+ - **`ref_attn`** β€” a dedicated cross-attention branch inside each
130
+ transformer block, reading those tokens.
131
+ - **`ref_adaln_proj`** β€” a global AdaLN bias derived from the reference
132
+ (palette / overall look).
133
+ - **`role_embedding`** β€” an experimental token bias inspired by some of
134
+ Kijai's tests; whether it actually helps is still unclear.
135
+
136
+ These extra weights are saved alongside the LoRA in a `.module.safetensors`
137
+ sidecar because they are **not standard LoRA adapters** β€” the regular
138
+ ComfyUI LoRA loader can't consume them, so they need a dedicated node.
139
+
140
+ ### How to load
141
+
142
+ | File | What it is | Where it goes |
143
+ |---|---|---|
144
+ | `*.standard.safetensors` | LoRA on `attn1` / `attn2` / `ff` only | Standard ComfyUI LoRA loader |
145
+ | `*.module.safetensors` | `role_embedding`, `ref_adaln_proj`, `ref_visual_proj`, `ref_attn` LoRA adapters | `LTXVEditAnythingModuleLoader` (BFSnodes) |
146
+
147
+ Both files of a pair must be loaded **together** β€” the LoRA was trained
148
+ against the sidecar adapters and they only make sense as a unit. Do not mix
149
+ `.standard` from one build with `.module` from another.
150
+
151
+ The module file is consumed by the **`πŸ…›πŸ…£πŸ…§ LTXV Edit Anything Looping
152
+ Sampler`** node, which was written specifically to:
153
+
154
+ 1. Install the `ref_attn` cross-attention branch on every transformer block.
155
+ 2. Inject the AdaLN / role / visual cross-attention conditioning at the
156
+ correct points in the model.
157
+ 3. Sample long videos in overlapping chunks with the conditioning re-applied
158
+ per chunk.
159
+
160
+ ### Which build to use
161
+
162
+ - **`ref_adaln_proj-role_embedding`** β€” the original training. Only ships
163
+ the two side-channel modules.
164
+ - **`ref_adaln_proj-role_embedding-ref_attn-ref_visual_proj`** β€” the
165
+ continuation. Adds the visual cross-attention branch and its projector on
166
+ top.
167
+
168
+ It is genuinely **not clear yet** whether the extra branches help over the
169
+ plain LoRA. Both builds are honest experiments. Try both, decide for your
170
+ own use case, and please share findings.
171
+
172
+ ### Reading the layers
173
+
174
+ For anyone who wants to understand what each layer in the Ref V2V
175
+ checkpoint does:
176
+
177
+ - [`lora_layers_reference.md`](./lora_layers_reference.md) β€” full tensor
178
+ inventory of both builds.
179
+ - [`lora_layers_impact.md`](./lora_layers_impact.md) β€” what each branch
180
+ contributes at inference and which inference knob (`adaln_scale`,
181
+ `ref_context_scale`, `ref_token_scale`, `ref_start_block`,
182
+ `ref_end_block`, etc.) maps back to which training default.
183
+
184
+ ---
185
+
186
+ ## Prompt examples
187
+
188
+ The two LoRAs were trained on very different caption styles. Match the
189
+ style of whichever LoRA you're using β€” straying outside the training
190
+ distribution is the fastest way to get garbage out.
191
+
192
+ ### Edit Anything v1.1 β€” standard editing
193
+
194
+ The stage-1 dataset uses short imperative captions describing one or two
195
+ edits. Use the same shape at inference. Examples drawn from the training
196
+ distribution:
197
+
198
+ - *"Replace the stone statue of a man on the left with a young woman in a
199
+ green dress."*
200
+ - *"Add a black labrador retriever sitting beside the woman on the bench."*
201
+ - *"Remove the teacher from the classroom."*
202
+ - *"Alter the cap's colour from modern black to deep maroon."*
203
+ - *"Replace the fresh citrus-green background with a wooden desk."*
204
+ - *"Add faint tire tracks across the snow behind the car."*
205
+ - *"Add a black statue, a blue camera, a cyan towel, a red guitar and a
206
+ pink backpack to the lakeside pier."*
207
+
208
+ Tips:
209
+
210
+ - Imperative verbs: **Add / Replace / Remove / Alter / Change**.
211
+ - When replacing, **describe both** the original and the new subject so the
212
+ model can localise the edit.
213
+ - Keep captions short and concrete. Long flowery prose hurts.
214
+
215
+ ### Edit Anything v1.1 β€” motion transfer
216
+
217
+ Workflow:
218
+
219
+ 1. Pick a guide video.
220
+ 2. Edit **only the first frame** externally (Flux Kontext / "Klein", InstructPix2Pix, etc.)
221
+ to introduce the new subject in the desired pose and position.
222
+ 3. Feed the edited frame as the first frame of the input and the original
223
+ guide as motion source.
224
+ 4. The prompt should describe **the inserted subject and the action being
225
+ preserved**.
226
+
227
+ Examples:
228
+
229
+ - *"Replace the standing man holding the umbrella with a woman in a red
230
+ coat holding the same umbrella, walking across the puddles."*
231
+ - *"Add a tabby cat curled up in the armchair while the man in the
232
+ background keeps reading."*
233
+ - *"Replace the runner in the blue jersey with a man wearing a white shirt
234
+ and grey shorts running along the same path."*
235
+
236
+ Limits: fast or chaotic motion will fail; the inserted subject should
237
+ occupy roughly the same region/scale as what it replaces.
238
+
239
+ ### Reference V2V (Ref V2V) β€” Add and Replace
240
+
241
+ These captions are real samples from the ~1600-pair training set. They
242
+ describe the **target scene after the edit** in detail. The reference
243
+ image carries the *appearance* of the inserted subject; the caption
244
+ carries *position, pose, action, and surrounding context*.
245
+
246
+ **Add task** (the reference image holds the new subject):
247
+
248
+ - *"Add a middle-aged man with curly grey hair, a beard and glasses,
249
+ wearing a blue quarter-zip sweater, on the right side of the frame,
250
+ standing in front of a raw cut of meat on a tray."*
251
+ - *"Add a light-coloured small boat with dark seats and an outboard motor
252
+ floating in the water."*
253
+ - *"Add an open book filled with colourful pencils in the woman's hands."*
254
+ - *"Add a silver metallic bucket on the table in front of the blonde
255
+ character, with her hands stirring a mixture inside."*
256
+ - *"Add two miniature dolls, one blonde and one brunette, dressed in
257
+ patterned clothing, sitting at a small table with teacups and small
258
+ white vases on the countertop."*
259
+
260
+ **Replace task** (the reference image holds the new subject; the caption
261
+ also describes what is being replaced):
262
+
263
+ - *"Replace the standing kangaroo holding the bicycle handlebars with a
264
+ man wearing a white t-shirt, light brown shorts and a yellow cap,
265
+ holding the bicycle handlebars."*
266
+ - *"Replace the stone statue of a man on the left side with a young woman
267
+ in a green dress."*
268
+ - *"Replace the wooden barrel near the entrance with a large brown leather
269
+ suitcase."*
270
+
271
+ Tips for Ref V2V:
272
+
273
+ - **Describe the inserted subject in full**, even though the reference
274
+ image is the source of truth β€” the text path drives placement and pose.
275
+ - For *Replace*, **also describe what is being replaced** so the model can
276
+ match the spatial region.
277
+ - Keep the inserted subject roughly in the same scale and region as what
278
+ it replaces.
279
+ - The captions in the training set average ~25–40 words β€” aim for that
280
+ range. Single-sentence captions like *"Add a man"* are far too sparse
281
+ and will fail.
282
+
283
+ ---
284
+
285
+ ## ComfyUI nodes
286
+
287
+ All recommended inference paths run through the **BFSnodes** custom node
288
+ set. For now BFSnodes is the only place these nodes live; once they
289
+ stabilise they may move elsewhere.
290
+
291
+ Specific nodes used by these LoRAs:
292
+
293
+ - `πŸ…›πŸ…£πŸ…§ LTXV Edit Anything Looping Sampler` β€” sampler that injects role /
294
+ AdaLN / visual cross-attention and handles long videos in chunks.
295
+ - `LTXVEditAnythingModuleLoader` β€” load the `*.module.safetensors` sidecar.
296
+
297
+ ---
298
+
299
+ ## Status
300
+
301
+ Released as experimental research artefacts. Expect failures, do not
302
+ deploy, and please report what works and what doesn't.
303
+
304
+ ---
305
+
306
+ ## Credits
307
+
308
+ If you use these models β€” in a project, a demo, a paper, a video, a tweet,
309
+ a workflow, anything β€” **please credit my work**. These checkpoints are the
310
+ result of weeks of research, dataset building, and training runs, and that
311
+ effort is what makes any of it usable. Crediting the source is the bare
312
+ minimum that keeps open research like this sustainable.
313
+
314
+ **Author:** Alisson Pereira dos Anjos ([@Alissonerdx](https://huggingface.co/Alissonerdx))
315
+
316
+ Suggested attribution:
317
+
318
+ > Edit Anything LoRAs by Alisson Pereira dos Anjos
319
+ > ([huggingface.co/Alissonerdx/EditAnything](https://huggingface.co/Alissonerdx/EditAnything)).
320
+
321
+ Links back to this repository are appreciated wherever you publish results.