SyFeee
/

LTX2.3-Dual-Character-en

@@ -5,21 +5,24 @@ base_model:
 tags:
   - video-generation
   - lora
-  - ic-lora
   - ltx-video
   - dual-character
   - dialogue
   - cinematic
   - chinese-drama
 pipeline_tag: image-to-video
 language:
   - en
   - zh
 ---
-# LTX-Video 2.3 IC-LoRA: Dual-Character (English mirror)
-An English-mirrored, field-tested **In-Context LoRA** for `Lightricks/LTX-2.3` (22B distilled), tuned for two-character dialogue scenes and multi-shot cinematic video generation.
 ---
@@ -43,80 +46,85 @@ Episode is an 8-shot Chinese palace drama (《玉佩定情》 + 《暗夜阴谋
 ## What this LoRA does
-An **In-Context LoRA (IC-LoRA)** trained on top of `Lightricks/LTX-2.3` (22B distilled), specifically tuned for:
 1. **Two-character dialogue scenes** — significantly reduces character drift when two people appear in the same frame
 2. **Cinematic shot composition** — reinforced for dialogue-driven framing (close-up ↔ medium ↔ wide)
 3. **Multi-shot narrative continuity** — better understanding of multi-segment prompts (storyboard-style descriptions)
 4. **Style compatibility** — works well across 古风仙侠 (ancient Chinese fantasy), 现代都市 (modern urban), and 3D 动漫 styles
-This is an **IC-LoRA** (in-context LoRA), so it expects reference images to be passed through the parallel-canvas conditioning mechanism, NOT as pixel-pinned frames. See the [Lightricks `ltx_pipelines.ic_lora.ICLoraPipeline`](https://github.com/Lightricks/LTX-2) for the upstream pipeline.
----
-## Model card
-| Field | Value |
-|---|---|
-| Base model | `Lightricks/LTX-2.3` (22B distilled) |
-| LoRA type | IC-LoRA (video-to-video conditioning) |
-| File | `LTX2.3-IC-LORA-Dual-Character.safetensors` (~313 MB) |
-| License | Apache 2.0 |
-| Trigger word | None — no special token required |
 ---
-## Field-tested production usage
-The notes below are from running this LoRA in production as part of a multi-shot Chinese drama video generation pipeline. They go beyond what's in the original model card.
-### Strength
-- **Standalone:** 0.7–0.9 works well
-- **When stacking with other LoRAs:** drop to 0.3–0.5 to stay under the typical 1.5 over-baking ceiling
-### Resolution
-- **Recommended:** 1280×704 (16:9, native LTX-2.3 distilled training resolution)
-- **Faster preview:** 960×544 (~40% faster, slightly less detail)
-- **Avoid portrait (9:16)** — this LoRA was trained on landscape; identity quality degrades noticeably in portrait orientation
-- Width and height must each be divisible by 64
-### Number of frames
-LTX-2.3 requires `num_frames` to satisfy `8k + 1` (e.g., 121, 145, 193, 241, 361). At 24 fps:
-- 5 s shot = 121 frames
-- 8 s shot = 193 frames
-- 15 s shot = 361 frames
-### Prompt structure that works well
-Use a 3-block structure: `[场景] / [角色] / [镜头与情节]`.
-```text
-[场景] 古风皇宫御花园桃花径，午后金色阳光透过盛开桃花斜射，
-粉色花瓣随风飘落，朱红宫墙翠竹环绕。
-[角色] 沈月华：年轻女子，长黑发半扎绿玉簪，鬓边一朵小白花，
-柔和圆润大眼，肤色白皙。身穿浅蓝色丝绸汉服宫装，白色云鹤刺绣。
-萧云霄：年轻男子，黑发束起金冠玉饰，剑眉星目。身穿深红色丝绸金线龙纹宫袍。
-[镜头与情节] 中景两人画面。沈月华手中持翠绿玉佩，萧云霄从右侧朱红
-宫墙转角缓步走出停下，拱手轻施一礼，目光温和注视沈月华手中玉佩。
-电影级布光，浅景深虚化，35mm 双人中景，温暖色调。
-```
-### Production tips earned the hard way
-These are quirks of this LoRA + the LTX-2.3 distilled backbone that aren't documented in the original model card but matter in practice:
-#### 1. Static-image reference: use a SHORT video wrap (≤ 8 frames)
-If you wrap a single PNG character ref into a video for IC-LoRA conditioning, **use 8 frames (≈ 0.33 s) — NOT 30 frames**. Longer static wraps cause a "first second stuck on the ref image" beat at the start of every clip. The IC-LoRA's per-frame matching dominates motion onset when the static wrap is too long.
-#### 2. Repeat color tokens for dark-clothed characters
-This LoRA has a light-wuxia-robe bias. Dark outfits drift toward white at low ref-image-strength. Recipe: **repeat the color token glued to each clothing noun**:
 ```text
 BAD:  black fedora and black suit
@@ -124,55 +132,53 @@ GOOD: BLACK fedora, white shirt, BLACK suit jacket, BLACK trousers,
       ... BLACK suit, BLACK trousers throughout
 ```
-Also bump `ref_image_strength` to 0.55 (action) or 0.85 (medium-slow) for color fidelity.
-#### 3. **Never use quoted dialogue in prompts**
-This LoRA was trained on Chinese drama clips with burned-in Chinese subtitles. **Any quoted dialogue (`「…」` or `"…"`) in the prompt causes the LoRA to hallucinate subtitle characters at the bottom of the frame.** This is the single biggest gotcha.
 ```text
-BAD:  低声警告 「此茶不可饮！」    ← produces fake on-screen subtitles
 GOOD: 低声急切警告她茶水有毒        ← clean output, indirect narration
 ```
-If your application needs actual subtitles, burn them post-hoc via `ffmpeg drawtext`, not via the prompt.
-#### 4. Avoid "object detaches" prompts during action
-At high motion intensity (cas/ris ≤ 0.40), the model loses object tracking. A directive like "fedora flies off mid-spin and tumbles to the floor" produces broken output — the hat dematerialises. Either:
 - Keep the object attached and say so explicitly ("the fedora STAYS ON his head throughout the spin")
 - Or render attach + detach as two clips and concat
-#### 5. Cross-shot identity drift
-For multi-shot dialogue scenes, character identity drifts across cuts. Workaround: chain shots by passing a **12-frame tail clip** of shot N as `ref_videos=[(tail.mp4, 0.7)]` for shot N+1. Significantly improves continuity.
 ### Render performance
-- **Resolution:** 1280×704, 121 frames @ 24 fps (~5 s output)
-- **Hardware:** NVIDIA A800 80 GB
-- **Time:** ~70 s per shot (8-step distilled + 3-step spatial upscaler + audio decode)
 - **Output:** mp4 with ambient audio track (no TTS)
-On consumer hardware (RTX 4090 24 GB), expect ~3–4 minutes per shot due to memory pressure from the 22B model.
 ---
 ## Limitations
-From the original author + our field testing:
-1. **Subtitle hallucination** with quoted dialogue (see tip #3 above)
 2. **Complex physical interactions** (wrestling, hugging, intricate hand-on-hand) can deform
-3. **Tail-frame artifact** of LTX-2.3 — last 6–8 frames may smear; trim post-hoc if needed
 4. **Action complexity ceiling** — the 8-step distilled budget caps motion complexity at action peaks
 5. **Portrait orientation** degrades identity (LoRA trained on landscape only)
 ---
 ## Original Chinese README (preserved)
-The original Chinese model card from ModelScope is reproduced below for users who want the unmodified original documentation.
 <details>
 <summary>点击展开原版中文模型卡片 (click to expand original Chinese README)</summary>
@@ -226,49 +232,12 @@ The original Chinese model card from ModelScope is reproduced below for users wh
 ---
-## How to use
-### With the upstream Lightricks pipeline
-```python
-from ltx_pipelines.ic_lora import ICLoraPipeline
-from ltx_core.loader import LoraPathStrengthAndSDOps
-from ltx_core.loader import sd_ops as _sd_ops_mod
-import torch
-# Use the IC-LoRA's standard SDOps mapping
-lora = LoraPathStrengthAndSDOps(
-    "LTX2.3-IC-LORA-Dual-Character.safetensors",
-    0.8,                                      # strength (standalone)
-    _sd_ops_mod.LTXV_LORA_COMFY_RENAMING_MAP,
-)
-pipe = ICLoraPipeline(
-    distilled_checkpoint_path="ltx-2.3-22b-distilled-1.1.safetensors",
-    spatial_upsampler_path="ltx-2.3-spatial-upscaler-x2-1.1.safetensors",
-    gemma_root="google/gemma-3-12b-it-qat-q4_0-unquantized",
-    loras=[lora],
-    device=torch.device("cuda:0"),
-)
-video, audio = pipe(
-    prompt="...",                # your structured 3-block prompt
-    seed=42,
-    height=704, width=1280,
-    num_frames=121,              # 5 s @ 24 fps, satisfies 8k+1
-    frame_rate=24,
-    video_conditioning=[("char_ref.mp4", 0.85)],   # 8-frame static wrap of the character portrait
-    enhance_prompt=False,
-    conditioning_attention_strength=0.85,
-)
-```
-### Hardware requirements
 | GPU | VRAM | Works? |
 |---|---|---|
 | A100 / A800 80 GB | 80 GB | ✅ ~70 s per 5 s shot |
-| RTX 4090 / 3090 | 24 GB | ✅ ~3–4 min per 5 s shot |
 | RTX 4080 / 4070 Ti Super | 16 GB | ❌ won't fit 22B in bf16 |
 | anything < 24 GB | — | ❌ no |
@@ -277,13 +246,14 @@ video, audio = pipe(
 ## Acknowledgements
 - **麻雀 AI (Maque AI)** — original author of this LoRA, [original ModelScope repository](https://www.modelscope.cn/models/fxj1131/LTX2.3-IC-LORA-Dual-Character)
-- **[Lightricks](https://www.lightricks.com/)** — for the LTX-Video 2.3 base model and the IC-LoRA framework
 ---
 ## Source attribution
-> ⚠️ **This is an English-language mirror of [fxj1131's LTX2.3 IC-LoRA Dual-Character on ModelScope](https://www.modelscope.cn/models/fxj1131/LTX2.3-IC-LORA-Dual-Character).**
 > All credit for the model weights belongs to the original author, **麻雀 AI (Maque AI)**.
 > This mirror exists to make the model + documentation accessible to HuggingFace users who cannot easily access ModelScope, and to share field-tested usage notes from a production deployment.
 > **The `.safetensors` weights file is unmodified and byte-identical to the ModelScope upload.**

 tags:
   - video-generation
   - lora
   - ltx-video
   - dual-character
   - dialogue
   - cinematic
   - chinese-drama
+  - image-to-video
 pipeline_tag: image-to-video
 language:
   - en
   - zh
 ---
+# LTX-Video 2.3 — Dual-Character LoRA (English mirror)
+A field-tested **image-to-video character-consistency LoRA** for `Lightricks/LTX-2.3` (22B distilled), tuned for two-character dialogue scenes and multi-shot cinematic video generation.
+> ⚠️ **Naming note (corrected 2026-05-21):**
+> The original filename and ModelScope repo include the string "IC-LORA", but **this is NOT an IC-LoRA** in the strict technical sense (parallel-canvas / `video_conditioning` mechanism). An A/B/C test (same prompt + seed, three reference-channel variants) confirmed that the LoRA's actual conditioning mechanism is **first-frame pixel pinning** (the regular i2v path), not parallel-canvas attention. Earlier copy on this card incorrectly described it as IC-LoRA — that has been removed. Credit to ZKong for raising the discrepancy in the discussions tab.
 ---
 ## What this LoRA does
+Fine-tuned on `Lightricks/LTX-2.3` (22B distilled), specifically for:
 1. **Two-character dialogue scenes** — significantly reduces character drift when two people appear in the same frame
 2. **Cinematic shot composition** — reinforced for dialogue-driven framing (close-up ↔ medium ↔ wide)
 3. **Multi-shot narrative continuity** — better understanding of multi-segment prompts (storyboard-style descriptions)
 4. **Style compatibility** — works well across 古风仙侠 (ancient Chinese fantasy), 现代都市 (modern urban), and 3D 动漫 styles
+The reference image is consumed via **first-frame pixel pin** (standard i2v conditioning), not via the parallel-canvas / `video_conditioning` channel.
 ---
+## How to use (correct pattern)
+### Single-character shot
+```python
+# Upstream LTX-2.3 distilled pipeline — single reference as first-frame pin
+from ltx_pipelines.distilled import DistilledPipeline
+from ltx_pipelines.utils.args import ImageConditioningInput
+from ltx_core.loader import LoraPathStrengthAndSDOps, sd_ops as _sd_ops_mod
+import torch
+lora = LoraPathStrengthAndSDOps(
+    "LTX2.3-IC-LORA-Dual-Character.safetensors",
+    0.8,                                       # strength (standalone)
+    _sd_ops_mod.LTXV_LORA_COMFY_RENAMING_MAP,
+)
+pipe = DistilledPipeline(
+    distilled_checkpoint_path="ltx-2.3-22b-distilled-1.1.safetensors",
+    spatial_upsampler_path="ltx-2.3-spatial-upscaler-x2-1.1.safetensors",
+    gemma_root="google/gemma-3-12b-it-qat-q4_0-unquantized",
+    loras=[lora],
+    device=torch.device("cuda:0"),
+)
+video, audio = pipe(
+    prompt="...",
+    seed=42,
+    height=704, width=1280,
+    num_frames=121,                              # 5 s @ 24 fps, satisfies 8k+1
+    frame_rate=24,
+    images=[ImageConditioningInput(             # first-frame pin = THE reference mechanism
+        path="character_ref.png",
+        frame_idx=0,
+        strength=0.9,
+    )],
+    enhance_prompt=False,
+)
+```
+### Dual-character shot
+LTX's i2v pin rejects two pins at the same `frame_idx`, so two refs can't both be pinned at frame 0. Two workable patterns:
+**Pattern A (recommended): composite reference image.** Build one image with character A on the left and character B on the right (e.g., via PIL `Image.paste` or any image editor), pin THAT at `frame_idx=0`. Both identities transfer in one pin.
+**Pattern B: stagger the pins.** Pin character A at frame 0, character B at a later latent boundary (e.g., frame 64 — must be a multiple of 8 per the VAE's temporal compression). Only works if B doesn't need to be visible from the very first frame.
+### Recommended parameters
+| Setting | Value |
+|---|---|
+| Resolution | 1280 × 704 (16:9, native LTX-2.3 distilled training resolution) |
+| Faster preview | 960 × 544 (~40% faster, slightly less detail) |
+| Frames | satisfy 8k+1 — e.g. 121 (5 s), 193 (8 s), 241 (10 s), 361 (15 s) at 24 fps |
+| Strength | Standalone 0.7-0.9 · stacked with style LoRAs 0.3-0.5 |
+| Pin strength | 0.85-0.95 for tight identity, 0.7 for looser "inspired-by" |
+| Trigger word | None |
+---
+## Field-tested production tips
+Quirks of this LoRA + the LTX-2.3 distilled backbone that aren't in the original card but matter in practice.
+### 1. Repeat color tokens for dark-clothed characters
+This LoRA has a light-wuxia-robe bias. Dark outfits drift toward white at low pin strength. **Repeat the color token glued to each clothing noun**:
 ```text
 BAD:  black fedora and black suit
       ... BLACK suit, BLACK trousers throughout
 ```
+Also bump pin strength to ~0.95 for color fidelity on dark outfits.
+### 2. **Never use quoted dialogue in prompts**
+This LoRA was trained on Chinese drama clips with burned-in Chinese subtitles. **Any quoted dialogue (`「…」` or `"…"`) in the prompt causes the LoRA to hallucinate subtitle characters at the bottom of the frame.** Single biggest gotcha.
 ```text
+BAD:  低声警告 「此茶不可饮！」    ← fake on-screen subtitles
 GOOD: 低声急切警告她茶水有毒        ← clean output, indirect narration
 ```
+If your app needs subtitles, burn them post-hoc via `ffmpeg drawtext`.
+### 3. Avoid "object detaches" prompts during action
+At high motion intensity, the model loses object tracking. A directive like "fedora flies off mid-spin and tumbles to the floor" produces broken output — the hat dematerialises. Either:
 - Keep the object attached and say so explicitly ("the fedora STAYS ON his head throughout the spin")
 - Or render attach + detach as two clips and concat
+### 4. Cross-shot identity drift
+For multi-shot dialogue scenes, character identity drifts across cuts. Workaround: re-pin the reference image at frame 0 of every shot. (Deterministic seed + same first-frame pin + same prompt scaffolding produces good repeatability.)
 ### Render performance
+- **Resolution:** 1280 × 704, 121 frames @ 24 fps (~5 s output)
+- **Hardware:** NVIDIA A800 80 GB → ~70 s per shot
 - **Output:** mp4 with ambient audio track (no TTS)
+On consumer hardware (RTX 4090 24 GB), expect ~3-4 minutes per shot.
 ---
 ## Limitations
+1. **Subtitle hallucination** with quoted dialogue (see tip #2)
 2. **Complex physical interactions** (wrestling, hugging, intricate hand-on-hand) can deform
+3. **Tail-frame artifact** of LTX-2.3 — last 6-8 frames may smear; trim post-hoc if needed
 4. **Action complexity ceiling** — the 8-step distilled budget caps motion complexity at action peaks
 5. **Portrait orientation** degrades identity (LoRA trained on landscape only)
+6. **Dual-character via two separate refs is awkward** (see "How to use" above) — composite-image pin is the cleanest workaround
 ---
 ## Original Chinese README (preserved)
+The original Chinese model card from ModelScope is reproduced below for users who want the unmodified original documentation. (Note: the original card uses the "IC-LoRA" label — the term has been kept here for fidelity, even though the A/B/C test described above shows the conditioning mechanism is first-frame i2v pinning rather than parallel-canvas IC-LoRA.)
 <details>
 <summary>点击展开原版中文模型卡片 (click to expand original Chinese README)</summary>
 ---
+## Hardware requirements
 | GPU | VRAM | Works? |
 |---|---|---|
 | A100 / A800 80 GB | 80 GB | ✅ ~70 s per 5 s shot |
+| RTX 4090 / 3090 | 24 GB | ✅ ~3-4 min per 5 s shot |
 | RTX 4080 / 4070 Ti Super | 16 GB | ❌ won't fit 22B in bf16 |
 | anything < 24 GB | — | ❌ no |
 ## Acknowledgements
 - **麻雀 AI (Maque AI)** — original author of this LoRA, [original ModelScope repository](https://www.modelscope.cn/models/fxj1131/LTX2.3-IC-LORA-Dual-Character)
+- **[Lightricks](https://www.lightricks.com/)** — for the LTX-Video 2.3 base model
+- **ZKong** — for catching the IC-LoRA labeling discrepancy in the discussion thread; the empirical A/B/C test ran in response settled it
 ---
 ## Source attribution
+> This is an English-language mirror of [fxj1131's LTX2.3 Dual-Character LoRA on ModelScope](https://www.modelscope.cn/models/fxj1131/LTX2.3-IC-LORA-Dual-Character).
 > All credit for the model weights belongs to the original author, **麻雀 AI (Maque AI)**.
 > This mirror exists to make the model + documentation accessible to HuggingFace users who cannot easily access ModelScope, and to share field-tested usage notes from a production deployment.
 > **The `.safetensors` weights file is unmodified and byte-identical to the ModelScope upload.**