SyFeee commited on
Commit
23ed3c5
·
verified ·
1 Parent(s): 480ecfd

docs: correct IC-LoRA mislabeling — empirical A/B/C test shows mechanism is first-frame i2v pin, not parallel-canvas IC-LoRA (credit ZKong)

Browse files
Files changed (1) hide show
  1. README.md +81 -111
README.md CHANGED
@@ -5,21 +5,24 @@ base_model:
5
  tags:
6
  - video-generation
7
  - lora
8
- - ic-lora
9
  - ltx-video
10
  - dual-character
11
  - dialogue
12
  - cinematic
13
  - chinese-drama
 
14
  pipeline_tag: image-to-video
15
  language:
16
  - en
17
  - zh
18
  ---
19
 
20
- # LTX-Video 2.3 IC-LoRA: Dual-Character (English mirror)
21
 
22
- An English-mirrored, field-tested **In-Context LoRA** for `Lightricks/LTX-2.3` (22B distilled), tuned for two-character dialogue scenes and multi-shot cinematic video generation.
 
 
 
23
 
24
  ---
25
 
@@ -43,80 +46,85 @@ Episode is an 8-shot Chinese palace drama (《玉佩定情》 + 《暗夜阴谋
43
 
44
  ## What this LoRA does
45
 
46
- An **In-Context LoRA (IC-LoRA)** trained on top of `Lightricks/LTX-2.3` (22B distilled), specifically tuned for:
47
 
48
  1. **Two-character dialogue scenes** — significantly reduces character drift when two people appear in the same frame
49
  2. **Cinematic shot composition** — reinforced for dialogue-driven framing (close-up ↔ medium ↔ wide)
50
  3. **Multi-shot narrative continuity** — better understanding of multi-segment prompts (storyboard-style descriptions)
51
  4. **Style compatibility** — works well across 古风仙侠 (ancient Chinese fantasy), 现代都市 (modern urban), and 3D 动漫 styles
52
 
53
- This is an **IC-LoRA** (in-context LoRA), so it expects reference images to be passed through the parallel-canvas conditioning mechanism, NOT as pixel-pinned frames. See the [Lightricks `ltx_pipelines.ic_lora.ICLoraPipeline`](https://github.com/Lightricks/LTX-2) for the upstream pipeline.
54
-
55
- ---
56
-
57
- ## Model card
58
-
59
- | Field | Value |
60
- |---|---|
61
- | Base model | `Lightricks/LTX-2.3` (22B distilled) |
62
- | LoRA type | IC-LoRA (video-to-video conditioning) |
63
- | File | `LTX2.3-IC-LORA-Dual-Character.safetensors` (~313 MB) |
64
- | License | Apache 2.0 |
65
- | Trigger word | None — no special token required |
66
 
67
  ---
68
 
69
- ## Field-tested production usage
70
-
71
- The notes below are from running this LoRA in production as part of a multi-shot Chinese drama video generation pipeline. They go beyond what's in the original model card.
72
-
73
- ### Strength
74
 
75
- - **Standalone:** 0.7–0.9 works well
76
- - **When stacking with other LoRAs:** drop to 0.3–0.5 to stay under the typical 1.5 over-baking ceiling
77
 
78
- ### Resolution
 
 
 
 
 
79
 
80
- - **Recommended:** 1280×704 (16:9, native LTX-2.3 distilled training resolution)
81
- - **Faster preview:** 960×544 (~40% faster, slightly less detail)
82
- - **Avoid portrait (9:16)** — this LoRA was trained on landscape; identity quality degrades noticeably in portrait orientation
83
- - Width and height must each be divisible by 64
 
84
 
85
- ### Number of frames
 
 
 
 
 
 
86
 
87
- LTX-2.3 requires `num_frames` to satisfy `8k + 1` (e.g., 121, 145, 193, 241, 361). At 24 fps:
88
- - 5 s shot = 121 frames
89
- - 8 s shot = 193 frames
90
- - 15 s shot = 361 frames
 
 
 
 
 
 
 
 
 
 
91
 
92
- ### Prompt structure that works well
93
 
94
- Use a 3-block structure: `[场景] / [角色] / [镜头与情节]`.
95
 
96
- ```text
97
- [场景] 古风皇宫御花园桃花径,午后金色阳光透过盛开桃花斜射,
98
- 粉色花瓣随风飘落,朱红宫墙翠竹环绕。
99
 
100
- [角色] 沈月华:年轻女子,长黑发半扎绿玉簪,鬓边一朵小白花,
101
- 柔和圆润大眼,肤色白皙。身穿浅蓝色丝绸汉服宫装,白色云鹤刺绣。
102
- 萧云霄:年轻男子,黑发束起金冠玉饰,剑眉星目。身穿深红色丝绸金线龙纹宫袍。
103
 
104
- [镜头与情节] 中景两人画面。沈月华手中持翠绿玉佩,萧云霄从右侧朱红
105
- 宫墙转角缓步走出停下,拱手轻施一礼,目光温和注视沈月华手中玉佩。
106
- 电影级布光,浅景深虚化,35mm 双人中景,温暖色调。
107
- ```
108
 
109
- ### Production tips earned the hard way
 
 
 
 
 
 
 
110
 
111
- These are quirks of this LoRA + the LTX-2.3 distilled backbone that aren't documented in the original model card but matter in practice:
112
 
113
- #### 1. Static-image reference: use a SHORT video wrap (≤ 8 frames)
114
 
115
- If you wrap a single PNG character ref into a video for IC-LoRA conditioning, **use 8 frames (≈ 0.33 s) NOT 30 frames**. Longer static wraps cause a "first second stuck on the ref image" beat at the start of every clip. The IC-LoRA's per-frame matching dominates motion onset when the static wrap is too long.
116
 
117
- #### 2. Repeat color tokens for dark-clothed characters
118
 
119
- This LoRA has a light-wuxia-robe bias. Dark outfits drift toward white at low ref-image-strength. Recipe: **repeat the color token glued to each clothing noun**:
120
 
121
  ```text
122
  BAD: black fedora and black suit
@@ -124,55 +132,53 @@ GOOD: BLACK fedora, white shirt, BLACK suit jacket, BLACK trousers,
124
  ... BLACK suit, BLACK trousers throughout
125
  ```
126
 
127
- Also bump `ref_image_strength` to 0.55 (action) or 0.85 (medium-slow) for color fidelity.
128
 
129
- #### 3. **Never use quoted dialogue in prompts**
130
 
131
- This LoRA was trained on Chinese drama clips with burned-in Chinese subtitles. **Any quoted dialogue (`「…」` or `"…"`) in the prompt causes the LoRA to hallucinate subtitle characters at the bottom of the frame.** This is the single biggest gotcha.
132
 
133
  ```text
134
- BAD: 低声警告 「此茶不可饮!」 ← produces fake on-screen subtitles
135
  GOOD: 低声急切警告她茶水有毒 ← clean output, indirect narration
136
  ```
137
 
138
- If your application needs actual subtitles, burn them post-hoc via `ffmpeg drawtext`, not via the prompt.
139
 
140
- #### 4. Avoid "object detaches" prompts during action
141
 
142
- At high motion intensity (cas/ris ≤ 0.40), the model loses object tracking. A directive like "fedora flies off mid-spin and tumbles to the floor" produces broken output — the hat dematerialises. Either:
143
  - Keep the object attached and say so explicitly ("the fedora STAYS ON his head throughout the spin")
144
  - Or render attach + detach as two clips and concat
145
 
146
- #### 5. Cross-shot identity drift
147
 
148
- For multi-shot dialogue scenes, character identity drifts across cuts. Workaround: chain shots by passing a **12-frame tail clip** of shot N as `ref_videos=[(tail.mp4, 0.7)]` for shot N+1. Significantly improves continuity.
149
 
150
  ### Render performance
151
 
152
- - **Resolution:** 1280×704, 121 frames @ 24 fps (~5 s output)
153
- - **Hardware:** NVIDIA A800 80 GB
154
- - **Time:** ~70 s per shot (8-step distilled + 3-step spatial upscaler + audio decode)
155
  - **Output:** mp4 with ambient audio track (no TTS)
156
 
157
- On consumer hardware (RTX 4090 24 GB), expect ~34 minutes per shot due to memory pressure from the 22B model.
158
 
159
  ---
160
 
161
  ## Limitations
162
 
163
- From the original author + our field testing:
164
-
165
- 1. **Subtitle hallucination** with quoted dialogue (see tip #3 above)
166
  2. **Complex physical interactions** (wrestling, hugging, intricate hand-on-hand) can deform
167
- 3. **Tail-frame artifact** of LTX-2.3 — last 68 frames may smear; trim post-hoc if needed
168
  4. **Action complexity ceiling** — the 8-step distilled budget caps motion complexity at action peaks
169
  5. **Portrait orientation** degrades identity (LoRA trained on landscape only)
 
170
 
171
  ---
172
 
173
  ## Original Chinese README (preserved)
174
 
175
- The original Chinese model card from ModelScope is reproduced below for users who want the unmodified original documentation.
176
 
177
  <details>
178
  <summary>点击展开原版中文模型卡片 (click to expand original Chinese README)</summary>
@@ -226,49 +232,12 @@ The original Chinese model card from ModelScope is reproduced below for users wh
226
 
227
  ---
228
 
229
- ## How to use
230
-
231
- ### With the upstream Lightricks pipeline
232
-
233
- ```python
234
- from ltx_pipelines.ic_lora import ICLoraPipeline
235
- from ltx_core.loader import LoraPathStrengthAndSDOps
236
- from ltx_core.loader import sd_ops as _sd_ops_mod
237
- import torch
238
-
239
- # Use the IC-LoRA's standard SDOps mapping
240
- lora = LoraPathStrengthAndSDOps(
241
- "LTX2.3-IC-LORA-Dual-Character.safetensors",
242
- 0.8, # strength (standalone)
243
- _sd_ops_mod.LTXV_LORA_COMFY_RENAMING_MAP,
244
- )
245
-
246
- pipe = ICLoraPipeline(
247
- distilled_checkpoint_path="ltx-2.3-22b-distilled-1.1.safetensors",
248
- spatial_upsampler_path="ltx-2.3-spatial-upscaler-x2-1.1.safetensors",
249
- gemma_root="google/gemma-3-12b-it-qat-q4_0-unquantized",
250
- loras=[lora],
251
- device=torch.device("cuda:0"),
252
- )
253
-
254
- video, audio = pipe(
255
- prompt="...", # your structured 3-block prompt
256
- seed=42,
257
- height=704, width=1280,
258
- num_frames=121, # 5 s @ 24 fps, satisfies 8k+1
259
- frame_rate=24,
260
- video_conditioning=[("char_ref.mp4", 0.85)], # 8-frame static wrap of the character portrait
261
- enhance_prompt=False,
262
- conditioning_attention_strength=0.85,
263
- )
264
- ```
265
-
266
- ### Hardware requirements
267
 
268
  | GPU | VRAM | Works? |
269
  |---|---|---|
270
  | A100 / A800 80 GB | 80 GB | ✅ ~70 s per 5 s shot |
271
- | RTX 4090 / 3090 | 24 GB | ✅ ~34 min per 5 s shot |
272
  | RTX 4080 / 4070 Ti Super | 16 GB | ❌ won't fit 22B in bf16 |
273
  | anything < 24 GB | — | ❌ no |
274
 
@@ -277,13 +246,14 @@ video, audio = pipe(
277
  ## Acknowledgements
278
 
279
  - **麻雀 AI (Maque AI)** — original author of this LoRA, [original ModelScope repository](https://www.modelscope.cn/models/fxj1131/LTX2.3-IC-LORA-Dual-Character)
280
- - **[Lightricks](https://www.lightricks.com/)** — for the LTX-Video 2.3 base model and the IC-LoRA framework
 
281
 
282
  ---
283
 
284
  ## Source attribution
285
 
286
- > ⚠️ **This is an English-language mirror of [fxj1131's LTX2.3 IC-LoRA Dual-Character on ModelScope](https://www.modelscope.cn/models/fxj1131/LTX2.3-IC-LORA-Dual-Character).**
287
  > All credit for the model weights belongs to the original author, **麻雀 AI (Maque AI)**.
288
  > This mirror exists to make the model + documentation accessible to HuggingFace users who cannot easily access ModelScope, and to share field-tested usage notes from a production deployment.
289
  > **The `.safetensors` weights file is unmodified and byte-identical to the ModelScope upload.**
 
5
  tags:
6
  - video-generation
7
  - lora
 
8
  - ltx-video
9
  - dual-character
10
  - dialogue
11
  - cinematic
12
  - chinese-drama
13
+ - image-to-video
14
  pipeline_tag: image-to-video
15
  language:
16
  - en
17
  - zh
18
  ---
19
 
20
+ # LTX-Video 2.3 Dual-Character LoRA (English mirror)
21
 
22
+ A field-tested **image-to-video character-consistency LoRA** for `Lightricks/LTX-2.3` (22B distilled), tuned for two-character dialogue scenes and multi-shot cinematic video generation.
23
+
24
+ > ⚠️ **Naming note (corrected 2026-05-21):**
25
+ > The original filename and ModelScope repo include the string "IC-LORA", but **this is NOT an IC-LoRA** in the strict technical sense (parallel-canvas / `video_conditioning` mechanism). An A/B/C test (same prompt + seed, three reference-channel variants) confirmed that the LoRA's actual conditioning mechanism is **first-frame pixel pinning** (the regular i2v path), not parallel-canvas attention. Earlier copy on this card incorrectly described it as IC-LoRA — that has been removed. Credit to ZKong for raising the discrepancy in the discussions tab.
26
 
27
  ---
28
 
 
46
 
47
  ## What this LoRA does
48
 
49
+ Fine-tuned on `Lightricks/LTX-2.3` (22B distilled), specifically for:
50
 
51
  1. **Two-character dialogue scenes** — significantly reduces character drift when two people appear in the same frame
52
  2. **Cinematic shot composition** — reinforced for dialogue-driven framing (close-up ↔ medium ↔ wide)
53
  3. **Multi-shot narrative continuity** — better understanding of multi-segment prompts (storyboard-style descriptions)
54
  4. **Style compatibility** — works well across 古风仙侠 (ancient Chinese fantasy), 现代都市 (modern urban), and 3D 动漫 styles
55
 
56
+ The reference image is consumed via **first-frame pixel pin** (standard i2v conditioning), not via the parallel-canvas / `video_conditioning` channel.
 
 
 
 
 
 
 
 
 
 
 
 
57
 
58
  ---
59
 
60
+ ## How to use (correct pattern)
 
 
 
 
61
 
62
+ ### Single-character shot
 
63
 
64
+ ```python
65
+ # Upstream LTX-2.3 distilled pipeline — single reference as first-frame pin
66
+ from ltx_pipelines.distilled import DistilledPipeline
67
+ from ltx_pipelines.utils.args import ImageConditioningInput
68
+ from ltx_core.loader import LoraPathStrengthAndSDOps, sd_ops as _sd_ops_mod
69
+ import torch
70
 
71
+ lora = LoraPathStrengthAndSDOps(
72
+ "LTX2.3-IC-LORA-Dual-Character.safetensors",
73
+ 0.8, # strength (standalone)
74
+ _sd_ops_mod.LTXV_LORA_COMFY_RENAMING_MAP,
75
+ )
76
 
77
+ pipe = DistilledPipeline(
78
+ distilled_checkpoint_path="ltx-2.3-22b-distilled-1.1.safetensors",
79
+ spatial_upsampler_path="ltx-2.3-spatial-upscaler-x2-1.1.safetensors",
80
+ gemma_root="google/gemma-3-12b-it-qat-q4_0-unquantized",
81
+ loras=[lora],
82
+ device=torch.device("cuda:0"),
83
+ )
84
 
85
+ video, audio = pipe(
86
+ prompt="...",
87
+ seed=42,
88
+ height=704, width=1280,
89
+ num_frames=121, # 5 s @ 24 fps, satisfies 8k+1
90
+ frame_rate=24,
91
+ images=[ImageConditioningInput( # first-frame pin = THE reference mechanism
92
+ path="character_ref.png",
93
+ frame_idx=0,
94
+ strength=0.9,
95
+ )],
96
+ enhance_prompt=False,
97
+ )
98
+ ```
99
 
100
+ ### Dual-character shot
101
 
102
+ LTX's i2v pin rejects two pins at the same `frame_idx`, so two refs can't both be pinned at frame 0. Two workable patterns:
103
 
104
+ **Pattern A (recommended): composite reference image.** Build one image with character A on the left and character B on the right (e.g., via PIL `Image.paste` or any image editor), pin THAT at `frame_idx=0`. Both identities transfer in one pin.
 
 
105
 
106
+ **Pattern B: stagger the pins.** Pin character A at frame 0, character B at a later latent boundary (e.g., frame 64 — must be a multiple of 8 per the VAE's temporal compression). Only works if B doesn't need to be visible from the very first frame.
 
 
107
 
108
+ ### Recommended parameters
 
 
 
109
 
110
+ | Setting | Value |
111
+ |---|---|
112
+ | Resolution | 1280 × 704 (16:9, native LTX-2.3 distilled training resolution) |
113
+ | Faster preview | 960 × 544 (~40% faster, slightly less detail) |
114
+ | Frames | satisfy 8k+1 — e.g. 121 (5 s), 193 (8 s), 241 (10 s), 361 (15 s) at 24 fps |
115
+ | Strength | Standalone 0.7-0.9 · stacked with style LoRAs 0.3-0.5 |
116
+ | Pin strength | 0.85-0.95 for tight identity, 0.7 for looser "inspired-by" |
117
+ | Trigger word | None |
118
 
119
+ ---
120
 
121
+ ## Field-tested production tips
122
 
123
+ Quirks of this LoRA + the LTX-2.3 distilled backbone that aren't in the original card but matter in practice.
124
 
125
+ ### 1. Repeat color tokens for dark-clothed characters
126
 
127
+ This LoRA has a light-wuxia-robe bias. Dark outfits drift toward white at low pin strength. **Repeat the color token glued to each clothing noun**:
128
 
129
  ```text
130
  BAD: black fedora and black suit
 
132
  ... BLACK suit, BLACK trousers throughout
133
  ```
134
 
135
+ Also bump pin strength to ~0.95 for color fidelity on dark outfits.
136
 
137
+ ### 2. **Never use quoted dialogue in prompts**
138
 
139
+ This LoRA was trained on Chinese drama clips with burned-in Chinese subtitles. **Any quoted dialogue (`「…」` or `"…"`) in the prompt causes the LoRA to hallucinate subtitle characters at the bottom of the frame.** Single biggest gotcha.
140
 
141
  ```text
142
+ BAD: 低声警告 「此茶不可饮!」 ← fake on-screen subtitles
143
  GOOD: 低声急切警告她茶水有毒 ← clean output, indirect narration
144
  ```
145
 
146
+ If your app needs subtitles, burn them post-hoc via `ffmpeg drawtext`.
147
 
148
+ ### 3. Avoid "object detaches" prompts during action
149
 
150
+ At high motion intensity, the model loses object tracking. A directive like "fedora flies off mid-spin and tumbles to the floor" produces broken output — the hat dematerialises. Either:
151
  - Keep the object attached and say so explicitly ("the fedora STAYS ON his head throughout the spin")
152
  - Or render attach + detach as two clips and concat
153
 
154
+ ### 4. Cross-shot identity drift
155
 
156
+ For multi-shot dialogue scenes, character identity drifts across cuts. Workaround: re-pin the reference image at frame 0 of every shot. (Deterministic seed + same first-frame pin + same prompt scaffolding produces good repeatability.)
157
 
158
  ### Render performance
159
 
160
+ - **Resolution:** 1280 × 704, 121 frames @ 24 fps (~5 s output)
161
+ - **Hardware:** NVIDIA A800 80 GB → ~70 s per shot
 
162
  - **Output:** mp4 with ambient audio track (no TTS)
163
 
164
+ On consumer hardware (RTX 4090 24 GB), expect ~3-4 minutes per shot.
165
 
166
  ---
167
 
168
  ## Limitations
169
 
170
+ 1. **Subtitle hallucination** with quoted dialogue (see tip #2)
 
 
171
  2. **Complex physical interactions** (wrestling, hugging, intricate hand-on-hand) can deform
172
+ 3. **Tail-frame artifact** of LTX-2.3 — last 6-8 frames may smear; trim post-hoc if needed
173
  4. **Action complexity ceiling** — the 8-step distilled budget caps motion complexity at action peaks
174
  5. **Portrait orientation** degrades identity (LoRA trained on landscape only)
175
+ 6. **Dual-character via two separate refs is awkward** (see "How to use" above) — composite-image pin is the cleanest workaround
176
 
177
  ---
178
 
179
  ## Original Chinese README (preserved)
180
 
181
+ The original Chinese model card from ModelScope is reproduced below for users who want the unmodified original documentation. (Note: the original card uses the "IC-LoRA" label — the term has been kept here for fidelity, even though the A/B/C test described above shows the conditioning mechanism is first-frame i2v pinning rather than parallel-canvas IC-LoRA.)
182
 
183
  <details>
184
  <summary>点击展开原版中文模型卡片 (click to expand original Chinese README)</summary>
 
232
 
233
  ---
234
 
235
+ ## Hardware requirements
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
236
 
237
  | GPU | VRAM | Works? |
238
  |---|---|---|
239
  | A100 / A800 80 GB | 80 GB | ✅ ~70 s per 5 s shot |
240
+ | RTX 4090 / 3090 | 24 GB | ✅ ~3-4 min per 5 s shot |
241
  | RTX 4080 / 4070 Ti Super | 16 GB | ❌ won't fit 22B in bf16 |
242
  | anything < 24 GB | — | ❌ no |
243
 
 
246
  ## Acknowledgements
247
 
248
  - **麻雀 AI (Maque AI)** — original author of this LoRA, [original ModelScope repository](https://www.modelscope.cn/models/fxj1131/LTX2.3-IC-LORA-Dual-Character)
249
+ - **[Lightricks](https://www.lightricks.com/)** — for the LTX-Video 2.3 base model
250
+ - **ZKong** — for catching the IC-LoRA labeling discrepancy in the discussion thread; the empirical A/B/C test ran in response settled it
251
 
252
  ---
253
 
254
  ## Source attribution
255
 
256
+ > This is an English-language mirror of [fxj1131's LTX2.3 Dual-Character LoRA on ModelScope](https://www.modelscope.cn/models/fxj1131/LTX2.3-IC-LORA-Dual-Character).
257
  > All credit for the model weights belongs to the original author, **麻雀 AI (Maque AI)**.
258
  > This mirror exists to make the model + documentation accessible to HuggingFace users who cannot easily access ModelScope, and to share field-tested usage notes from a production deployment.
259
  > **The `.safetensors` weights file is unmodified and byte-identical to the ModelScope upload.**