--- license: other license_name: ltx-2-community-license-agreement tags: - ltx-2 - ic-lora - head-swap - video-to-video - image-to-video - bfs - lora base_model: - Lightricks/LTX-2 library_name: diffusers pipeline_tag: image-to-video --- ## ⚠️ Ethical Use & Disclaimer This model is a technical tool designed for **Digital Identity Research, Professional VFX Workflows, and Cinematic Prototyping**. By downloading or using this LoRA, you acknowledge and agree to the following: * **Intended Use:** Designed for filmmakers, VFX artists, and researchers exploring high-fidelity video identity transformation. * **Consent & Rights:** You must possess explicit legal consent and all necessary rights from any individual whose likeness is being processed. * **Legal Compliance:** You are fully responsible for complying with all local and international laws regarding synthetic media. * **Liability Waiver:** This model is provided *"as is."* **As the creator (Alissonerdx), I assume no responsibility for misuse.** Any legal, ethical, or social consequences are solely the responsibility of the end user. --- # 📺 Video Examples ## V1 Examples Generated using the **Frame 0 Anchoring Technique**. All examples follow the guide video motion while preserving the identity provided in the first frame. | Example 1 | Example 2 | | ------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------ | | | | | Example 3 | Example 4 | | ------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------ | | | | | Example 5 | | ------------------------------------------------------------------------------------------------------------------------------------------ | | | ## V3 Examples If you want to see the full setup in practice, watch here: [https://www.youtube.com/watch?v=HBp03iu7wLA](https://www.youtube.com/watch?v=HBp03iu7wLA) The following examples demonstrate the new **persistent-template workflow** used in V3: | Example 6 | Example 7 | | ------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------ | | | | | Example 8 | | ------------------------------------------------------------------------------------------------------------------------------------------ | | | The image references for the versions are stored under: ```txt ltx-2.3/... ``` --- # 🛠 Technical Background (V1) To achieve this level of identity transfer, I **heavily modified the official LTX-2 training scripts**. ### Key Improvements * **Novel Conditioning Injection:** Custom latent injection methods for reference identity stabilization. * **Noise Distribution Overhaul:** Implemented a **custom High-Noise Power Law timestep distribution**, forcing the model to prioritize target identity reconstruction over guide-video context. * **Training Compute:** 60+ hours of training on **NVIDIA RTX PRO 6000 Blackwell GPUs**, iterating through 300GB+ of experimental checkpoints. --- # 📊 Dataset Specifications ## V1 Dataset * **300 high-quality head swap video pairs** * Trained on **512x512 buckets** * Primarily **landscape format** * Optimized for **close-up framing** Wide shots may reduce identity fidelity. --- # 💡 Inference Guide (V1) ## 🔴 CRITICAL — Frame 0 Requirement This version was trained to use **Frame 0 as the identity anchor**. You must prepare the first frame correctly. ### Recommended Workflow 1. Perform a high-quality head swap on Frame 0. 2. Use that processed frame as conditioning input. 3. Run the full video generation. For best results, prepare Frame 0 using my previous **BFS Image Models**. --- ## Optimization ### LoRA Strength * **1.0** → Best motion fidelity * **>1.0** → Stronger identity and hair capture, but may distort original motion ### Multi-Pass Workflows You can experiment with multiple passes using different strengths. ### Prompting Detailed prompts currently have **no effect**. Trigger remains: ```txt head swap ``` --- # ⚠️ Known Issues (V1 – Alpha) * **Identity Leakage:** Hair from the guide video may reappear. * **Hard Cuts:** Jump cuts can reset identity. * **Portrait Format:** Performance is significantly better in landscape. --- # 🚀 Version 2 – Major Update V2 introduces a **complete redesign of conditioning strategy and masking logic**, significantly improving identity robustness and reducing leakage. --- ## 🔹 Multiple Conditioning Modes (Using First Frame) V2 supports multiple identity injection approaches: ### 1️⃣ Direct Photo Conditioning Use a clean photo of the new face as reference input. This method works and can produce strong results. However, because the model must internally reconcile lighting, perspective, depth, and occlusion differences, it may need to fight to correctly integrate the new identity into the guide video. In some cases, this can reduce stability or identity consistency. ### 2️⃣ First-Frame Head Swap (Recommended) Applying a proper head swap on Frame 0 still produces **extremely strong and reliable results**. Because the first frame is already structurally correct — pose, lighting, depth, and occlusions — the model has significantly less work to do. Instead of forcing alignment from a static photo, it simply propagates and stabilizes the identity through time. This approach generally: * Produces higher identity fidelity * Reduces deformation * Minimizes integration artifacts * Improves overall temporal stability ### 3️⃣ Automatic Magazine-Style Overlay The new face is automatically cut and positioned over the guide face using mask alignment. This simulates a magazine-cutout-style overlay, but performed automatically based on mask positioning. ### 4️⃣ Manual Overlay Advanced users may manually composite the new face over Frame 0 before running inference. --- ## 🔹 Facial Motion Behavior (Important Change) Unlike V1: **V2 does not follow the original guide face’s facial micro-movements.** The guide face is fully masked to prevent identity leakage. This makes masking quality critical. ### Mask Requirements * The guide face must be completely covered. * Mask color must be a **magenta tone**. * Any visible guide identity may leak into the final output. --- ## 🔹 Mask Types Users may alternate between: ### ▪ Square Masks * More stable identity * Better consistency * Often produce stronger overall results * May generate slightly oversized heads due to spatial padding In most scenarios, square masks tend to perform better because they provide additional spatial context for the model to reconstruct structure and hair. ### ▪ Tight / Adjusted Masks * More natural head proportions * May deform if guide head shape differs significantly * Sensitive to long-hair mismatches If the original guide has long hair and the new identity does not, deformation risk increases. --- ## 🔹 Dataset & Training Improvements (V2) * **800+ video pairs** * Trained at **768 resolution** * **768** is the recommended inference resolution * Improved hair stability * Reduced identity leakage compared to V1 * More robust identity transfer under motion --- ## 🔹 First Pass vs Second Pass You may: * Run a single pass at 768 (recommended) * Or run a downscaled first pass plus a second upscale pass ⚠️ Important: A second pass may alter identity from the first pass and reduce consistency in some cases. --- ## 🔹 Trigger Trigger remains: ```txt head swap ``` --- # 🚀 Version 3 – Persistent Template Workflow V3 introduces a new **persistent-template conditioning workflow**. Unlike previous versions, which relied primarily on the identity being established from **Frame 0 only**, V3 uses a **custom guide-video construction step** that keeps the new face visible throughout the entire guide sequence. This results in a much stronger and more persistent identity signal during inference. # 🙏 Acknowledgements Special thanks to **facy.ai** for sponsoring the GPU used to train this model. If you want to check their platform, you can use my referral link: [https://facy.ai/a/headswap](https://facy.ai/a/headswap) --- ## 🔹 How V3 Works V3 uses a custom node from **ComfyUI-BFSNodes** to prepare the guide video before inference. Repository: ```txt https://github.com/alisson-anjos/ComfyUI-BFSNodes ``` Workflow file: ```txt workflows/workflow_ltx2_head_swap_drag_and_drop_v3.0 ``` The guide-video preparation process works like this: 1. Start from the original guide video 2. Add a **vertical green chroma-key strip** on the side 3. Place the **reference face image** inside that strip 4. Apply this composition to **every frame** of the original video 5. Use this new composite video as the actual inference guide This means the new identity remains **fully visible during all frames** of the guide video, instead of appearing only in Frame 0 like in previous versions. That is the main reason V3 can achieve better consistency than earlier versions. --- ## 🔹 Why V3 Is Different Because the identity reference stays visible during the full guide sequence, V3 gives the model a much more stable conditioning signal across time. In practice, this can improve: * Identity consistency * Temporal stability * Resistance to identity drift * Facial motion continuity * Lip sync behavior * Expressive facial movement preservation This version is especially useful for shots where the face remains visible for longer periods, or where dialogue, mouth movement, and facial acting matter more. V3 is not just a refinement of the first-frame method. It changes the conditioning logic by giving the model access to a persistent identity template across the entire inference sequence. --- ## 🔹 Final Output Behavior Even though the guide video used during inference contains the **vertical chroma-key side strip**, the **final generated result does not include that strip**. The generated video is returned in the **original resolution and framing** of the source guide video. So in practice: * The green side strip exists only in the internal guide/template video * It is used only to improve inference conditioning * It does not appear in the final output --- ## 🔹 Prompting for V3 For V3, users can also pass the composite guide video into a vision-capable model to extract a structured prompt. This is useful because the composite video contains two different information sources: * the **reference identity** inside the side strip * the **performance and scene information** in the main video area This helps keep identity and action description separated more cleanly. ### Recommended Prompt Template ```txt Analyze this composite video. The video contains: 1. a side chroma-key panel with a reference face image 2. a main performance video showing the body, clothing, movement, hand actions, objects, framing, and environment Your task is to extract: - the target face identity from the side panel - the performance/action from the main video Critical rules: - The side-panel face is the only valid source for identity traits and head-level accessories. - Ignore the visible face and head appearance in the main video completely. - Do not describe any face, hair, hairstyle, hair color, eye color, makeup, facial features, facial expression, attractiveness, headwear, hood, hat, or accessories from the main video. - In the ACTION section, describe the performer only as "a person" and focus only on body movement, clothing, hand actions, objects, framing, and environment. - Do not mention the chroma panel, green background, split layout, or editing structure. - Be factual and non-creative. - Do not guess uncertain details. If a detail is not clearly visible, omit it. Return exactly in this format: head_swap: FACE: A brief but detailed objective identity description from the side-panel face only. Include, when clearly visible: apparent gender, apparent ethnicity, skin tone or complexion, approximate age range, head shape, hair or baldness pattern, hair color, eye color, facial hair, visible skin details, headwear or head covering, visible facial accessories, and any especially distinctive facial trait. Prioritize the eyes when they are a strong defining feature. ACTION: A concise performance description from the main video. Include only: visible clothing, body position, movement, hand actions, objects being shown or handled, camera-facing behavior, framing, and environment. Do not include any face or head appearance from the main video. Good example: FACE: Female, fair skin, approximately 20-30 years old, oval head shape, long wavy vivid blue-violet hair, bright golden-amber eyes with dark defined pupils, no facial hair, smooth skin, and pink flower hair accessories as a distinctive head adornment. ACTION: A person in a dark top faces the camera indoors, holds a package of false eyelashes close to the lens, peels one lash from the backing, brings it near the eye area, and examines it while making small hand movements. Bad example: ACTION: A person with long curly blonde braids holds a pair of false eyelashes... ``` ### How to Use 1. Generate the V3 composite guide video using the node 2. Pass that composite video into a vision-capable model 3. Extract the structured **FACE** and **ACTION** prompt 4. Use that output as the base prompt for the V3 workflow --- ## 🔹 Captions / Descriptions for V3 If you want automatic captions or prompt extraction from video, you can also use my Ollama nodes. Repository: ```txt https://github.com/alisson-anjos/ComfyUI-Ollama-Describer ``` A useful node for this workflow is: **Ollama Video Describer** This can help generate structured descriptions from the composite guide video and make it easier to build the final prompt for V3. --- ## 🔹 V3 Trigger Trigger remains: ```txt head_swap: FACE: .... ACTION: .... ``` --- # 🔴 Critical Success Factor (V2 / V3) Mask and preparation quality still matter enormously. Even with improved conditioning, final quality depends on: * Proper face coverage * Clean compositing * Strong alignment * Good source and reference quality If any portion of the original guide identity remains visible where it should not, the model may still reintroduce unwanted traits. Take time to refine your inputs. Better preparation consistently produces better output than simply increasing LoRA strength. --- ## 🔧 Advanced Technique: Combine with LTX-2 Inpainting Advanced users can experiment with combining this LoRA with the native **LTX-2 inpainting workflow**. This can help: * Refine problematic areas * Correct small deformation zones * Improve edge blending * Recover detail in hair or jaw regions When properly combined, inpainting can significantly enhance final output quality, especially in challenging frames. --- ## 🔹 Recommendation I strongly recommend testing **both LoRAs** and comparing the final behavior. Depending on the guide clip, framing, facial motion, and the kind of result you want, some users may prefer the look or motion style of one version over the other. In general: * **V2** may still be preferred for some first-frame-driven workflows * **V3** is better when you want a stronger persistent identity signal, better consistency, and better facial/lip motion continuity The best version will often depend on the shot and on personal preference. --- # 💙 Support Maintaining R&D and renting Blackwell GPUs is expensive. If this project helps you, consider supporting the development of: * V3 improvements * Advanced conditioning pipelines * SAM 3 integration * Full reference-photo-only workflows Support here: [https://buymeacoffee.com/nrdx](https://buymeacoffee.com/nrdx)