| --- |
| license: apache-2.0 |
| base_model: |
| - Lightricks/LTX-2.3 |
| tags: |
| - video-generation |
| - lora |
| - ltx-video |
| - dual-character |
| - dialogue |
| - cinematic |
| - chinese-drama |
| - image-to-video |
| pipeline_tag: image-to-video |
| language: |
| - en |
| - zh |
| --- |
| |
| # LTX-Video 2.3 — Dual-Character LoRA (English mirror) |
|
|
| A field-tested **image-to-video character-consistency LoRA** for `Lightricks/LTX-2.3` (22B distilled), tuned for two-character dialogue scenes and multi-shot cinematic video generation. |
|
|
| > ⚠️ **Naming note (corrected 2026-05-21):** |
| > The original filename and ModelScope repo include the string "IC-LORA", but **this is NOT an IC-LoRA** in the strict technical sense (parallel-canvas / `video_conditioning` mechanism). An A/B/C test (same prompt + seed, three reference-channel variants) confirmed that the LoRA's actual conditioning mechanism is **first-frame pixel pinning** (the regular i2v path), not parallel-canvas attention. Earlier copy on this card incorrectly described it as IC-LoRA — that has been removed. Credit to ZKong for raising the discrepancy in the discussions tab. |
| |
| --- |
| |
| ## Example renders |
| |
| Episode is an 8-shot Chinese palace drama (《玉佩定情》 + 《暗夜阴谋》) with three characters: 沈月华 (Shen Yuehua, heroine), 萧云霄 (Xiao Yunxiao, prince), 慕容静 (Murong Jing, antagonist). Render config: 1280×704, 121 frames @ 24 fps, ambient audio. |
| |
| ### Single-character identity — Shen Yuehua walking in the garden, picks up a jade pendant |
| <video controls autoplay muted loop src="https://huggingface.co/SyFeee/LTX2.3-Dual-Character-en/resolve/main/examples/E1S1_garden_walk_single_character.mp4"></video> |
| |
| ### Dual-character dialogue — Shen + Xiao meet (the LoRA's signature use case) |
| <video controls autoplay muted loop src="https://huggingface.co/SyFeee/LTX2.3-Dual-Character-en/resolve/main/examples/E1S2_prince_meets_dual_character.mp4"></video> |
| |
| ### Cross-scene identity — Murong Jing in a different location (palace night chamber) |
| <video controls autoplay muted loop src="https://huggingface.co/SyFeee/LTX2.3-Dual-Character-en/resolve/main/examples/E2S1_murong_plots_cross_scene.mp4"></video> |
| |
| ### Three-character composition — the LoRA's upper limit |
| <video controls autoplay muted loop src="https://huggingface.co/SyFeee/LTX2.3-Dual-Character-en/resolve/main/examples/E2S4_three_character_confrontation.mp4"></video> |
| |
| --- |
| |
| ## What this LoRA does |
| |
| Fine-tuned on `Lightricks/LTX-2.3` (22B distilled), specifically for: |
| |
| 1. **Two-character dialogue scenes** — significantly reduces character drift when two people appear in the same frame |
| 2. **Cinematic shot composition** — reinforced for dialogue-driven framing (close-up ↔ medium ↔ wide) |
| 3. **Multi-shot narrative continuity** — better understanding of multi-segment prompts (storyboard-style descriptions) |
| 4. **Style compatibility** — works well across 古风仙侠 (ancient Chinese fantasy), 现代都市 (modern urban), and 3D 动漫 styles |
| |
| The reference image is consumed via **first-frame pixel pin** (standard i2v conditioning), not via the parallel-canvas / `video_conditioning` channel. |
|
|
| --- |
|
|
| ## How to use (correct pattern) |
|
|
| ### Single-character shot |
|
|
| ```python |
| # Upstream LTX-2.3 distilled pipeline — single reference as first-frame pin |
| from ltx_pipelines.distilled import DistilledPipeline |
| from ltx_pipelines.utils.args import ImageConditioningInput |
| from ltx_core.loader import LoraPathStrengthAndSDOps, sd_ops as _sd_ops_mod |
| import torch |
| |
| lora = LoraPathStrengthAndSDOps( |
| "LTX2.3-IC-LORA-Dual-Character.safetensors", |
| 0.8, # strength (standalone) |
| _sd_ops_mod.LTXV_LORA_COMFY_RENAMING_MAP, |
| ) |
| |
| pipe = DistilledPipeline( |
| distilled_checkpoint_path="ltx-2.3-22b-distilled-1.1.safetensors", |
| spatial_upsampler_path="ltx-2.3-spatial-upscaler-x2-1.1.safetensors", |
| gemma_root="google/gemma-3-12b-it-qat-q4_0-unquantized", |
| loras=[lora], |
| device=torch.device("cuda:0"), |
| ) |
| |
| video, audio = pipe( |
| prompt="...", |
| seed=42, |
| height=704, width=1280, |
| num_frames=121, # 5 s @ 24 fps, satisfies 8k+1 |
| frame_rate=24, |
| images=[ImageConditioningInput( # first-frame pin = THE reference mechanism |
| path="character_ref.png", |
| frame_idx=0, |
| strength=0.9, |
| )], |
| enhance_prompt=False, |
| ) |
| ``` |
|
|
| ### Dual-character shot |
|
|
| LTX's i2v pin rejects two pins at the same `frame_idx`, so two refs can't both be pinned at frame 0. Two workable patterns: |
|
|
| **Pattern A (recommended): composite reference image.** Build one image with character A on the left and character B on the right (e.g., via PIL `Image.paste` or any image editor), pin THAT at `frame_idx=0`. Both identities transfer in one pin. |
|
|
| **Pattern B: stagger the pins.** Pin character A at frame 0, character B at a later latent boundary (e.g., frame 64 — must be a multiple of 8 per the VAE's temporal compression). Only works if B doesn't need to be visible from the very first frame. |
|
|
| ### Recommended parameters |
|
|
| | Setting | Value | |
| |---|---| |
| | Resolution | 1280 × 704 (16:9, native LTX-2.3 distilled training resolution) | |
| | Faster preview | 960 × 544 (~40% faster, slightly less detail) | |
| | Frames | satisfy 8k+1 — e.g. 121 (5 s), 193 (8 s), 241 (10 s), 361 (15 s) at 24 fps | |
| | Strength | Standalone 0.7-0.9 · stacked with style LoRAs 0.3-0.5 | |
| | Pin strength | 0.85-0.95 for tight identity, 0.7 for looser "inspired-by" | |
| | Trigger word | None | |
|
|
| --- |
|
|
| ## Field-tested production tips |
|
|
| Quirks of this LoRA + the LTX-2.3 distilled backbone that aren't in the original card but matter in practice. |
|
|
| ### 1. Repeat color tokens for dark-clothed characters |
|
|
| This LoRA has a light-wuxia-robe bias. Dark outfits drift toward white at low pin strength. **Repeat the color token glued to each clothing noun**: |
|
|
| ```text |
| BAD: black fedora and black suit |
| GOOD: BLACK fedora, white shirt, BLACK suit jacket, BLACK trousers, |
| ... BLACK suit, BLACK trousers throughout |
| ``` |
|
|
| Also bump pin strength to ~0.95 for color fidelity on dark outfits. |
|
|
| ### 2. **Never use quoted dialogue in prompts** |
|
|
| This LoRA was trained on Chinese drama clips with burned-in Chinese subtitles. **Any quoted dialogue (`「…」` or `"…"`) in the prompt causes the LoRA to hallucinate subtitle characters at the bottom of the frame.** Single biggest gotcha. |
|
|
| ```text |
| BAD: 低声警告 「此茶不可饮!」 ← fake on-screen subtitles |
| GOOD: 低声急切警告她茶水有毒 ← clean output, indirect narration |
| ``` |
|
|
| If your app needs subtitles, burn them post-hoc via `ffmpeg drawtext`. |
|
|
| ### 3. Avoid "object detaches" prompts during action |
|
|
| At high motion intensity, the model loses object tracking. A directive like "fedora flies off mid-spin and tumbles to the floor" produces broken output — the hat dematerialises. Either: |
| - Keep the object attached and say so explicitly ("the fedora STAYS ON his head throughout the spin") |
| - Or render attach + detach as two clips and concat |
|
|
| ### 4. Cross-shot identity drift |
|
|
| For multi-shot dialogue scenes, character identity drifts across cuts. Workaround: re-pin the reference image at frame 0 of every shot. (Deterministic seed + same first-frame pin + same prompt scaffolding produces good repeatability.) |
|
|
| ### Render performance |
|
|
| - **Resolution:** 1280 × 704, 121 frames @ 24 fps (~5 s output) |
| - **Hardware:** NVIDIA A800 80 GB → ~70 s per shot |
| - **Output:** mp4 with ambient audio track (no TTS) |
|
|
| On consumer hardware (RTX 4090 24 GB), expect ~3-4 minutes per shot. |
|
|
| --- |
|
|
| ## Limitations |
|
|
| 1. **Subtitle hallucination** with quoted dialogue (see tip #2) |
| 2. **Complex physical interactions** (wrestling, hugging, intricate hand-on-hand) can deform |
| 3. **Tail-frame artifact** of LTX-2.3 — last 6-8 frames may smear; trim post-hoc if needed |
| 4. **Action complexity ceiling** — the 8-step distilled budget caps motion complexity at action peaks |
| 5. **Portrait orientation** degrades identity (LoRA trained on landscape only) |
| 6. **Dual-character via two separate refs is awkward** (see "How to use" above) — composite-image pin is the cleanest workaround |
|
|
| --- |
|
|
| ## Original Chinese README (preserved) |
|
|
| The original Chinese model card from ModelScope is reproduced below for users who want the unmodified original documentation. (Note: the original card uses the "IC-LoRA" label — the term has been kept here for fidelity, even though the A/B/C test described above shows the conditioning mechanism is first-frame i2v pinning rather than parallel-canvas IC-LoRA.) |
|
|
| <details> |
| <summary>点击展开原版中文模型卡片 (click to expand original Chinese README)</summary> |
|
|
| ### LTX-Video (2.3) IC-LoRA: 双人分镜头对话增强模型 |
|
|
| 本模型是基于 Lightricks LTX-2.3 底模训练的 IC-LoRA,专为双人同框对话、角色互动及分镜头视频生成场景深度优化。 |
|
|
| **一、模型核心提升** |
|
|
| 1. 角色参考稳定性:显著提升双人同框时的人物特征一致性,减少角色漂移。 |
| 2. 分镜构图稳定性:针对影视化对话构图进行了加固,支持更精准的镜头控制。 |
| 3. 叙事连贯性:增强了对多段描述的理解力,使分镜间的过渡衔接更自然。 |
| 4. 风格兼容性:完美支持古风仙侠、现代都市、3D 动漫等主流视觉风格。 |
|
|
| **二、模型基本信息** |
|
|
| 1. 基础模型:Lightricks/LTX-2.3 |
| 2. 许可证:Apache-2.0 |
| 3. 管道标签:image-to-video, text-to-video |
| 4. 模型用途:仅供学习交流使用 |
| 5. 开发者:麻雀 AI |
|
|
| **三、运行指南** |
|
|
| 1. 推荐平台:ComfyUI |
| 2. 支持工作流:ComfyUI 官方 LTX 工作流、KJ-LTX 插件工作流 |
| 3. 生成模式:文生视频 (T2V) 与 图生视频 (I2V) 均支持 |
| 4. 硬件参考:RTX 5090 显卡在 720P 分辨率下,单条视频生成耗时约 2 分钟 |
|
|
| **四、推荐参数配置** |
|
|
| 1. 分辨率:建议使用 16:9 (如 1280x720) |
| 2. 时长与帧率:建议时长 ≥10 秒,帧率设定为 24 FPS |
| 3. LoRA 权重设定: |
| - 独立使用建议:0.6 - 1.0 |
| - 叠加其他 LoRA 使用时建议:0.3 - 0.5 |
|
|
| **五、Prompt 编写规范** |
|
|
| 1. 编写逻辑:需包含完整的场景描述 + 角色设定 + 分镜设计 + 镜头语言,强化双人对话互动逻辑。 |
| 2. 触发词说明:无需特定触发词。 |
|
|
| **六、效果说明与局限性** |
|
|
| 1. 优势风格:在古风、现代、3D 动漫类双人对话场景中表现最佳。 |
| 2. 已知限制:受限于 LTX-2.3 底模性能,极其复杂的双人肢体互动(如缠绕、打斗)可能出现形变。 |
| 3. 运动幅度:建议以对话和微动作为主,大动态动作的连贯性仍有提升空间。 |
|
|
| </details> |
|
|
| --- |
|
|
| ## Hardware requirements |
|
|
| | GPU | VRAM | Works? | |
| |---|---|---| |
| | A100 / A800 80 GB | 80 GB | ✅ ~70 s per 5 s shot | |
| | RTX 4090 / 3090 | 24 GB | ✅ ~3-4 min per 5 s shot | |
| | RTX 4080 / 4070 Ti Super | 16 GB | ❌ won't fit 22B in bf16 | |
| | anything < 24 GB | — | ❌ no | |
|
|
| --- |
|
|
| ## Acknowledgements |
|
|
| - **麻雀 AI (Maque AI)** — original author of this LoRA, [original ModelScope repository](https://www.modelscope.cn/models/fxj1131/LTX2.3-IC-LORA-Dual-Character) |
| - **[Lightricks](https://www.lightricks.com/)** — for the LTX-Video 2.3 base model |
| - **ZKong** — for catching the IC-LoRA labeling discrepancy in the discussion thread; the empirical A/B/C test ran in response settled it |
|
|
| --- |
|
|
| ## Source attribution |
|
|
| > This is an English-language mirror of [fxj1131's LTX2.3 Dual-Character LoRA on ModelScope](https://www.modelscope.cn/models/fxj1131/LTX2.3-IC-LORA-Dual-Character). |
| > All credit for the model weights belongs to the original author, **麻雀 AI (Maque AI)**. |
| > This mirror exists to make the model + documentation accessible to HuggingFace users who cannot easily access ModelScope, and to share field-tested usage notes from a production deployment. |
| > **The `.safetensors` weights file is unmodified and byte-identical to the ModelScope upload.** |
|
|
| --- |
|
|
| ## License |
|
|
| Apache License 2.0 — same as the original. See `LICENSE` and `NOTICE`. |
|
|