--- license: other license_name: ltx-2-community-license-agreement license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE language: - en - zh library_name: diffusers tags: - video-generation - video-reasoning - logical-reasoning - lora - ltx-2.3 base_model: - Lightricks/LTX-2.3 --- # LTX-2 VBVR LoRA - Video Reasoning LoRA fine-tuned weights for LTX-2.3 22B on the VBVR (A Very Big Video Reasoning Suite) dataset. ## Training Data **To ensure training quality, we preprocessed the full 1,000,000 videos from the official dataset and randomly sample during training to maintain data diversity. We adopt the official parameters with batch_size=16 and rank=32 to prevent catastrophic forgetting caused by excessively large rank.** The VBVR dataset contains 200 reasoning task categories, with ~5,000 variants per task, totaling ~1M videos. Main task types include: - **Object Trajectory**: Objects moving to target positions - **Physical Reasoning**: Rolling balls, collisions, gravity - **Causal Relationships**: Conditional triggers, chain reactions - **Spatial Relationships**: Relative positions, path planning ## Model Details | Item | Details | |------|---------| | Base Model | ltx-2.3-22b-dev | | Training Method | LoRA Fine-tuning | | LoRA Rank | 32 | | Effective Batch Size | 16 | | Mixed Precision | BF16 | ## TODO List ### Dataset Release Plan | Dataset | Videos | Status | |---------|--------|--------| | VBVR-96K | 96,000 | ✅ Released | | VBVR-240K | 240,000 | 🔄 Processing | | VBVR-480K | 480,000 | 📋 Planned | ## LoRA Capabilities This LoRA adapter enhances the base LTX-2 model for production video generation workflows: - **Enhanced Complex Prompt Understanding**: Accurately interprets multi-object, multi-condition prompts with detailed spatial descriptions and temporal sequences, reducing prompt misinterpretation in production scenarios. - **Improved Motion Dynamics**: Generates smooth, physically plausible object movements with natural acceleration, deceleration, and trajectory curves, avoiding robotic or unnatural motion patterns. - **Temporal Consistency**: Maintains object appearance, lighting, and scene coherence throughout the video sequence, reducing flickering and frame-to-frame artifacts common in generated videos. - **Precise Timing Control**: Enables accurate control over action duration, pacing, and synchronization between multiple moving elements based on prompt semantics. - **Multi-Object Interaction**: Handles complex scenes with multiple objects interacting simultaneously, including collisions, following, avoiding, and coordinated movements. - **Camera and Framing Stability**: Maintains consistent camera perspective and framing throughout the sequence, avoiding unwanted camera shake or unexpected viewpoint changes. ## Training Configuration | Config | Value | |--------|-------| | Learning Rate | 1e-4 | | Scheduler | Cosine | | Gradient Accumulation | 16 steps | | Gradient Clipping | 1.0 | | Optimizer | AdamW | ## Evaluation Metrics ![Loss Training Curve](loss-plot-96000.png) | Metric | Value | |--------|-------| | Training Steps | ~6,000 | | Final Loss | ~0.008 | | Loss Reduction | 44% (from 0.014 to 0.008) | ## Video Demo ### Training Progress Comparison #### Step 0 (Base Model) Initial model output. #### Step 6000 (Fine-tuned) After 6K steps of training. ## Dataset This model is trained on the VBVR (Video Benchmark for Video Reasoning) dataset from [video-reason.com](https://video-reason.com/). ## Contact For questions or suggestions, please open an issue on Hugging Face or contact the author directly.