File size: 3,872 Bytes
6cbfbb6
 
 
 
 
 
 
3bbb172
6cbfbb6
 
 
 
 
bf8d13a
6cbfbb6
 
 
 
 
 
d3c8180
1338fa5
 
 
 
 
 
 
 
 
 
 
 
6cbfbb6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b3484a3
 
6cbfbb6
 
 
 
 
 
aea8aee
 
 
 
1fb17e5
 
 
 
 
 
 
 
 
 
 
aea8aee
6cbfbb6
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
license: other
license_name: ltx-2-community-license-agreement
license_link: https://github.com/Lightricks/LTX-2/blob/main/LICENSE
language:
- en
- zh
library_name: diffusers
tags:
- video-generation
- video-reasoning
- logical-reasoning
- lora
- ltx-2.3
base_model:
- Lightricks/LTX-2.3
---

# LTX-2 VBVR LoRA - Video Reasoning

LoRA fine-tuned weights for LTX-2.3 22B on the VBVR (A Very Big Video Reasoning Suite) dataset.

## Training Data

**To ensure training quality, we preprocessed the full 1,000,000 videos from the official dataset and randomly sample during training to maintain data diversity. We adopt the official parameters with batch_size=16 and rank=32 to prevent catastrophic forgetting caused by excessively large rank.**

The VBVR dataset contains 200 reasoning task categories, with ~5,000 variants per task, totaling ~1M videos. Main task types include:

- **Object Trajectory**: Objects moving to target positions
- **Physical Reasoning**: Rolling balls, collisions, gravity
- **Causal Relationships**: Conditional triggers, chain reactions
- **Spatial Relationships**: Relative positions, path planning

## Model Details

| Item | Details |
|------|---------|
| Base Model | ltx-2.3-22b-dev |
| Training Method | LoRA Fine-tuning |
| LoRA Rank | 32 |
| Effective Batch Size | 16 |
| Mixed Precision | BF16 |

## TODO List

### Dataset Release Plan

| Dataset | Videos | Status | 
|---------|--------|--------|
| VBVR-96K | 96,000 | ✅ Released |
| VBVR-240K | 240,000 | 🔄 Processing | 
| VBVR-480K | 480,000 | 📋 Planned |

## LoRA Capabilities

This LoRA adapter enhances the base LTX-2 model for production video generation workflows:

- **Enhanced Complex Prompt Understanding**: Accurately interprets multi-object, multi-condition prompts with detailed spatial descriptions and temporal sequences, reducing prompt misinterpretation in production scenarios.

- **Improved Motion Dynamics**: Generates smooth, physically plausible object movements with natural acceleration, deceleration, and trajectory curves, avoiding robotic or unnatural motion patterns.

- **Temporal Consistency**: Maintains object appearance, lighting, and scene coherence throughout the video sequence, reducing flickering and frame-to-frame artifacts common in generated videos.

- **Precise Timing Control**: Enables accurate control over action duration, pacing, and synchronization between multiple moving elements based on prompt semantics.

- **Multi-Object Interaction**: Handles complex scenes with multiple objects interacting simultaneously, including collisions, following, avoiding, and coordinated movements.

- **Camera and Framing Stability**: Maintains consistent camera perspective and framing throughout the sequence, avoiding unwanted camera shake or unexpected viewpoint changes.


## Training Configuration

| Config | Value |
|--------|-------|
| Learning Rate | 1e-4 |
| Scheduler | Cosine |
| Gradient Accumulation | 16 steps |
| Gradient Clipping | 1.0 |
| Optimizer | AdamW |

## Evaluation Metrics

![Loss Training Curve](loss-plot-96000.png)

| Metric | Value |
|--------|-------|
| Training Steps | ~6,000 |
| Final Loss | ~0.008 |
| Loss Reduction | 44% (from 0.014 to 0.008) |

## Video Demo

### Training Progress Comparison

#### Step 0 (Base Model)

<video src="https://huggingface.co/LiconStudio/Ltx2.3-VBVR-lora-I2V/resolve/main/step_000000_1.mp4" controls></video>

Initial model output.

#### Step 6000 (Fine-tuned)

<video src="https://huggingface.co/LiconStudio/Ltx2.3-VBVR-lora-I2V/resolve/main/step_006000_1.mp4" controls></video>

After 6K steps of training.

## Dataset

This model is trained on the VBVR (Video Benchmark for Video Reasoning) dataset from [video-reason.com](https://video-reason.com/). 


## Contact

For questions or suggestions, please open an issue on Hugging Face or contact the author directly.