82 kB

Title: Open-Sora Plan: Open-Source Large Video Generation Model

URL Source: https://arxiv.org/html/2412.00131

Markdown Content: Back to arXiv

This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.

Why HTML? Report Issue Back to Abstract Download PDF Abstract 1Introduction 2Core Models of Open-Sora Plan 3Assistant Strategies 4Data Curation Pipeline 5Results 6Limitation and Future Work 7Conclusion License: CC BY 4.0 arXiv:2412.00131v1 [cs.CV] 28 Nov 2024 Open-Sora Plan: Open-Source Large Video Generation Model Open-Sora Plan Team https://github.com/PKU-YuanGroup/Open-Sora-Plan

See Contributions section for full author list. Abstract

We introduce Open-Sora Plan, an open-source project that aims to contribute a large generation model for generating desired high-resolution videos with long durations based on various user inputs. Our project comprises multiple components for the entire video generation process, including a Wavelet-Flow Variational Autoencoder, a Joint Image-Video Skiparse Denoiser, and various condition controllers. Moreover, many assistant strategies for efficient training and inference are designed, and a multi-dimensional data curation pipeline is proposed for obtaining desired high-quality data. Benefiting from efficient thoughts, our Open-Sora Plan achieves impressive video generation results in both qualitative and quantitative evaluations. We hope our careful design and practical experience can inspire the video generation research community. All our codes and model weights are publicly available at https://github.com/PKU-YuanGroup/Open-Sora-Plan.

1Introduction

Driven by the recent progress of the diffusion model [ho2020denoising, song2020denoising] and transformer [vaswani2017attention, peebles2023scalable] architecture, visual content generation demonstrates impressive creation capacity conditioned on given prompts, which attracts broad interests and emerging attempts. Since the image generation methods [stable_diffusion, li2024hunyuan] achieve outstanding performance and are applied extensively, the video generation model is expected to make significant advancements to empower a variety of creative industries including entertainment, advertising, film, etc. Many early attempts [guo2023animatediff, dynamicrafter] successfully generate video with low resolution and short frames, but few efforts challenge the high-quality and long-duration video generation due to the unimaginable computation and data cost.

However, the technique report of Sora [videoworldsimulators2024], the video generation model created by OpenAI, with impressive showcases is released suddenly, shocking the entire video generation community while pointing out a promising way to create remarkable videos. As one of the first open-source projects aiming to re-implement a powerful Sora-like video generation model, our Open-Sora Plan attracts wide attention and contributes many first attempts to the video generation community, which inspires many subsequent works.

In this work, we summarize our practical experiences in recent months and present the technical details of our Open-Sora Plan, which generates high-quality and long-duration videos queried by various categories of conditions including text prompts, multiple images, and structure control signals (canny, depth, sketch, etc.). As illustrated in Fig. 1, we divide the video generation model into three key components and propose improvements for each part:

Figure 1:The model architecture of the Open-Sora Plan consists of a VAE, a Diffusion Transformer, and conditional encoders. The conditional injection encoders enable precise manipulation of individual frames (whether it’s the first frame, a subset of frames, or all frames) using designated structural signals, such as images, canny edges, depth maps, and sketches. •

Wavelet-Flow Variational Autoencoder. To reduce memory usage and enhance training speed, we propose WF-VAE, a model that obtains multi-scale features in the frequency domain through multi-level wavelet transform. These features are then injected into a convolutional backbone using a pyramid structure. We also introduced the Causal Cache method to address the issue of latent space disruption caused by tiling inference.

•

Joint Image-Video Skiparse Denoiser. We first change the 2+1D Sora-like video generation denoiser to a 3D full attention structure, significantly enhancing the model’s ability to understand the world, including object motion, camera movement, physics, and human actions. Our denoiser is capable of creating both high-quality images and videos with specific designs. We also introduce a cheap but effective operation called Skiparse Attention for further reducing computation.

•

Condition Controllers. We design a frame-level image condition controller to introduce image conditions into the basic model for supporting various tasks including Image-to-Video, Video Transition, and Video Continuation in one framework. Additionally, we develop a novel network architecture to introduce structure conditions into our base model for controllable generation.

In addition, we carefully design a series of assistant strategies during all stages for training more efficiently and achieving more appreciated results in inference:

•

Min-Max Token Strategy. The Open-Sora Plan uses min-max tokens for training, which aggregates data of different resolutions and durations within the same bucket. This strategy unlocks efficient NPUs/GPUs computation and maximizes the effective usage of data.

•

Adaptive Gradient Clipping Strategy. We propose an adaptive gradient clipping strategy that detects outlier data based on the gradient norm at each step, preventing outliers from skewing the model’s gradient direction.

•

Prompt Refinement Strategy. We develop a prompt refiner that enables the model to reasonably expand input prompts while following semantics. Prompt refiner alleviates the issue of inconsistencies in prompt length and descriptive granularity during training and generation, significantly enhancing the stability of video motion and enriching details.

Moreover, we propose an efficient data curation pipeline to automatically filter and annotate visual data from uncleaned datasets:

•

Multi-dimensional Data Processor. Our data curation pipeline includes detecting jump cuts, clipping videos, filtering out fast or slow motion, cropping edge subtitles, filtering aesthetic scores, assessing video technical quality, and annotating captions.

•

LPIPS-Based Jump Cuts Detection. We implement a video cut detection method based on Learned Perceptual Image Patch Similarity (LPIPS) [Zhang_Isola_Efros_Shechtman_Wang_2018] to prevent incorrect segmentation of fast-motion shots.

We notice that our Open-Sora Plan is an underway open-source project and we will make continuous efforts towards high-quality video generation. All latest news, codes, and model weights will be publicly updated at https://github.com/PKU-YuanGroup/Open-Sora-Plan.

2Core Models of Open-Sora Plan Figure 2:Overview of WF-VAE. WF-VAE [li2024wfvaeenhancingvideovae] consists of a backbone and a main energy path, with such a path injecting the main flow of video energy into the backbone through concatenations. 2.1Wavelet-Flow VAE

Preliminary. The multi-level Haar wavelet transform decomposes video signals by applying scaling filter 𝐡

1 2 ⁢ [ 1 , 1 ] and wavelet filter 𝐠

1 2 ⁢ [ 1 , − 1 ] along temporal and spatial dimensions. For a video signal 𝐕 ∈ ℝ 𝐶 × 𝑇 × 𝐻 × 𝑊 , where 𝐶 , 𝑇 , 𝐻 , and 𝑊 correspond to the number of channels, temporal frames, height, and width, the 3D Haar wavelet transform at layer 𝑙 is defined as:

𝐒 𝑖 ⁢ 𝑗 ⁢ 𝑘 ( 𝑙 )

𝐒 ( 𝑙 − 1 ) ∗ ( 𝑓 𝑖 ⊗ 𝑓 𝑗 ⊗ 𝑓 𝑘 ) ,

(1)

where 𝑓 𝑖 , 𝑓 𝑗 , 𝑓 𝑘 ∈ 𝐡 , 𝐠 represent the filters applied along each dimension, and ∗ represents the convolution operation. The transform begins with 𝐒 ( 0 )

𝐕 , and for subsequent layers, 𝐒 ( 𝑙 )

𝐒 ℎ ⁢ ℎ ⁢ ℎ ( 𝑙 − 1 ) , indicating that each layer operates on the low-frequency component from the previous layer. At each decomposition layer 𝑙 , the transform produces eight sub-band components: 𝒲 ( 𝑙 )

{ 𝐒 ℎ ⁢ ℎ ⁢ ℎ ( 𝑙 ) , 𝐒 ℎ ⁢ ℎ ⁢ 𝑔 ( 𝑙 ) , 𝐒 ℎ ⁢ 𝑔 ⁢ ℎ ( 𝑙 ) , 𝐒 𝑔 ⁢ ℎ ⁢ ℎ ( 𝑙 ) , 𝐒 ℎ ⁢ 𝑔 ⁢ 𝑔 ( 𝑙 ) , 𝐒 𝑔 ⁢ 𝑔 ⁢ ℎ ( 𝑙 ) , 𝐒 𝑔 ⁢ ℎ ⁢ 𝑔 ( 𝑙 ) , 𝐒 𝑔 ⁢ 𝑔 ⁢ 𝑔 ( 𝑙 ) } . Here, 𝐒 ℎ ⁢ ℎ ⁢ ℎ ( 𝑙 ) represents the low-frequency component across all dimensions, while 𝐒 𝑔 ⁢ 𝑔 ⁢ 𝑔 ( 𝑙 ) captures high-frequency details. To implement different downsampling rates in the temporal and spatial dimensions, a combination of 2D and 3D wavelet transforms can be implemented. Specifically, to obtain a compression rate of 4 × 8 × 8 (temporal × height × width), we can employ a combination of two-layer 3D wavelet transform followed by one-layer 2D wavelet transform.

Training Objective. Building upon the training strategies outlined in [rombach2022high], the proposed loss function integrates several components: reconstruction loss (including both L1 and perceptual losses [Zhang_Isola_Efros_Shechtman_Wang_2018]), adversarial loss, and KL divergence regularization. As illustrated in Fig. 2, our model architecture emphasizes a low-frequency energy flow and enforces symmetry between the encoder and decoder. To preserve this architectural principle, we introduce a novel regularization term, denoted as ℒ 𝑊 ⁢ 𝐿 (WL loss), which ensures structural consistency by penalizing deviations from the expected energy flow:

ℒ 𝑊 ⁢ 𝐿

| 𝒲 ^ ( 2 ) − 𝒲 ( 2 ) | + | 𝒲 ^ ( 3 ) − 𝒲 ( 3 ) | .

(2)

The overall loss function is defined as:

ℒ

ℒ 𝑟 ⁢ 𝑒 ⁢ 𝑐 ⁢ 𝑜 ⁢ 𝑛 + 𝜆 𝑎 ⁢ 𝑑 ⁢ 𝑣 ⁢ ℒ 𝑎 ⁢ 𝑑 ⁢ 𝑣 + 𝜆 𝐾 ⁢ 𝐿 ⁢ ℒ 𝐾 ⁢ 𝐿 + 𝜆 𝑊 ⁢ 𝐿 ⁢ ℒ 𝑊 ⁢ 𝐿 .

(3)

where 𝜆 𝑎 ⁢ 𝑑 ⁢ 𝑣 , 𝜆 𝐾 ⁢ 𝐿 , and 𝜆 𝑊 ⁢ 𝐿 are weighting coefficients for the corresponding loss components. Following [Esser_2021_CVPR], we adopt dynamic adversarial loss weighting to balance the relative gradient magnitudes of the adversarial and reconstruction losses:

𝜆 adv

1 2 ⁢ ( ‖ ∇ 𝐺 𝐿 [ ℒ recon ] ‖ ‖ ∇ 𝐺 𝐿 [ ℒ adv ] ‖ + 𝛿 ) ,

(4)

where ∇ 𝐺 𝐿 ⁢ [ ⋅ ] represents the gradient with respect to the final layer of the decoder, and 𝛿

10 − 6 is introduced for numerical stability.

{wrapfigure}

r0.5 Illustration of Causal Cache.

Causal Cache. We substitute regular 3D convolutions with causal 3D convolutions [yu2024languagemodelbeatsdiffusion] in WF-VAE with 𝑘 𝑡 − 1 temporal padding at the start, enabling unified processing of images and videos. We extract the first frame and process the remaining frames in chunks of size 𝑇 𝑐 ⁢ ℎ ⁢ 𝑢 ⁢ 𝑛 ⁢ 𝑘 for efficient inference of T-frame videos. We cache 𝑇 𝑐 ⁢ 𝑎 ⁢ 𝑐 ⁢ ℎ ⁢ 𝑒 ⁢ ( 𝑚 ) tail frames between chunks, where:

𝑇 𝑐 ⁢ 𝑎 ⁢ 𝑐 ⁢ ℎ ⁢ 𝑒 ⁢ ( 𝑚 )

𝑘 𝑡 + 𝑚 ⁢ 𝑇 𝑐 ⁢ ℎ ⁢ 𝑢 ⁢ 𝑛 ⁢ 𝑘 − 𝑠 𝑡 ⁢ ⌊ 𝑚 ⁢ 𝑇 𝑐 ⁢ ℎ ⁢ 𝑢 ⁢ 𝑛 ⁢ 𝑘 𝑠 𝑡 + 1 ⌋ .

(5)

This method necessitates that ( 𝑇 − 𝑘 𝑡 ) is divisible by 𝑠 𝑡 and ( 𝑇 − 1 ) mod 𝑠 𝑡

0 . We given a illustrated sample for understanding in Fig. 2.1, with 𝑘 𝑡

3 , 𝑠 𝑡

1 , 𝑇 𝑐 ⁢ ℎ ⁢ 𝑢 ⁢ 𝑛 ⁢ 𝑘

4 , 𝑇 𝑐 ⁢ 𝑎 ⁢ 𝑐 ⁢ ℎ ⁢ 𝑒 ⁢ ( 𝑚 )

2 frames are cached.

Training Details. We utilize the AdamW optimizer [Kingma_Ba_2014, loshchilov2019decoupledweightdecayregularization] with parameters 𝛽 1

0.9 and 𝛽 2

0.999 , maintaining a fixed learning rate of 1 × 10 − 5 . Our training process consists of three stages: (i) In the first stage, following the methodology of [chen2024od], we preprocess videos to contain 25 frames at a resolution of 256 × 256 , with a total batch size of 8. (ii) We update the discriminator, increase the number of frames to 49 and halve the frames per second (FPS) to enhance motion dynamics. (iii) We find that a large 𝜆 𝑙 ⁢ 𝑝 ⁢ 𝑖 ⁢ 𝑝 ⁢ 𝑠 adversely affects video stability; hence, we update the discriminator again and set 𝜆 𝑙 ⁢ 𝑝 ⁢ 𝑖 ⁢ 𝑝 ⁢ 𝑠 to 0.1 . The initial stage is trained for 800,000 steps, and the subsequent stages are each trained for 200,000 steps. The training process is conducted on 8 NPUs [liao2021ascend]/GPUs. We employ a 3D discriminator and initiate GAN training from the beginning.

2.2Joint Image-Video Skiparse Denoiser Figure 3:Overview of the Joint Image-Video Skiparse Denoiser. The model learns the denoising process in a low-dimensional latent space, which is compressed from input videos via our Wavelet-Flow VAE. Text prompts and timesteps are injected into each Cross-DiT block layer equipped with 3D RoPE. Our Skiparse attention is applied to every layer except the first and last two layers. 2.2.1Model Overview

As shown in Fig. 3, we compress input images or videos from pixel space to latent space for denoising training with the diffusion model. Given an input latent 𝑥 ∈ ℝ 𝐵 × 𝐶 × 𝑇 × 𝐻 × 𝑊 , we first split latent into small tokens by a 3D convolutional layer and flattened into a 1D sequence, with converting the latent dimension 𝐶 to dimension 𝐷 . We use kernel sizes 𝑘 𝑡

1 , 𝑘 ℎ

2 and 𝑘 𝑤

2 , with strides matching the kernel sizes, resulting in a total of 𝐿

𝑇 ⁢ 𝐻 ⁢ 𝑊 𝑘 𝑡 ⁢ 𝑘 ℎ ⁢ 𝑘 𝑤 tokens. We further use mT5-XXL [xue2020mt5] as the text encoder to map text prompts to a high-dimensional feature space, and we also convert text feature to dimension 𝐷 through a single MLP layer.

3D RoPE. We employ 3D rotational position encoding, which allows the model to directly compare relative differences between positions rather than relying on absolute positions. We define the computation process of 𝑛 ⁢ 𝐷 RoPE. After “patchifying” operation, the latent 𝐗 ∈ ℝ 𝐵 × 𝐿 × 𝐷 is divided into 𝑛 parts along the 𝐷 dimension, e.g., 𝐗

[ 𝐗 𝟏 , … , 𝐗 𝐧 ] , where 𝐗 𝐢 ∈ ℝ 𝐵 × 𝐿 × 𝐷 𝑛 , 𝑖 ∈ [ 1 , … , 𝑛 ] , and we apply RoPE on partitioned tensor 𝐗 𝐢 . Assuming the RoPE operation [su2024roformer] is denoted as RoPE ⁡ ( 𝐗 𝐢 ) , we inject the relative position encoding of the i-th dimension into tensor 𝐗 𝐢 , and concatenate processed tensors along the 𝐷 dimension to obtain the final result:

𝐗 𝐢 rope

RoPE ⁡ ( 𝐗 𝐢 ) ,

(6)

𝐗 final

Concat ⁡ ( [ 𝐗 𝟏 rope , … , 𝐗 𝐧 rope ] ) ,

(7)

where Concat ⁡ ( ⋅ ) denotes the concatenate operation and 𝐗 final ∈ ℝ 𝐵 × 𝐿 × 𝐷 . When 𝑛

1 , it is equivalent to applying RoPE on a 1D sequence in large language models. When 𝑛

2 , it can be viewed as 2D RoPE applied along the height and width directions of an image. When 𝑛

3 , RoPE is successfully applied to video data by incorporating relative position encoding in both the temporal and spatial dimensions to enhance the representation of sequences.

Block Design. Inspired by large language model architectures [dubey2024llama, yang2024qwen2, jiang2023mistral, young2024yi], we adopt a pre-norm transformer block structure primarily comprising self-attention, cross-attention, and a feedforward network. Following [peebles2023scalable, chen2023pixartalpha], we map timesteps to two sets of scale, shift, and gate parameters through adaLN-Zero [peebles2023scalable]. We then inject such two sets of values to self-attention and the FFN separately, and 3D RoPE is employed in self-attention layers. In version 1.2, we start to introduce Full 3D Attention instead of 2+1D Attention for significantly enhancing video motion smoothness and visual quality. However, the quadratic complexity of Full 3D Attention requires substantial computational resources, thus we propose a novel sparse attention mechanism. To ensure direct 3D interaction, we retain Full 3D Attention in the first and last two layers.

2.2.2Skiparse Attention

The 2+1D Attention widely leveraged by former video generation methods calculates frame interactions only along the temporal dimension, theoretically and practically limiting video generation performance. Compared to 2+1D Attention, Full 3D Attention represents global calculation for allowing content from arbitrarily spatial and temporal positions to interact, which approach aligns well with real-world physics. However, Full 3D Attention is time-consuming and inefficient, as visual information often contains considerable redundancy, making it unnecessary to establish attention across all spatiotemporal tokens. An ideal spatiotemporal modeling approach should employ attention that minimizes the overhead from redundant visual information while capturing the complexities of the dynamic physical world. Reducing redundancy requires avoiding connections among all tokens, yet global attention remains essential for modeling complex physical interactions.

Figure 4:Calculation process of Skiparse Attention with sparse ratio 𝑘

2 for example. In our Skiparse Attention operation, we alternately perform the Single Skip and the Group Skip operations, reducing the sequence length to 1 / 𝑘 compared to the original size in each operation. Figure 5:The interacted sequence scope of different attention mechanisms. Various attention mainly differ in the number and position of selected tokens during attention computations.

To balance the computation efficiency and spatiotemporal modeling ability, we propose a Skiparse (Skip-Sparse) Attention mechanism. Denoiser with Skiparse Attention only modifies the original attention layers to two alternating sparse attention operations named Single Skip and Group Skip in Transformer Blocks. Giving a sparse ratio 𝑘 , the sequence length in the attention operation reduces to 1 𝑘 compared to the original, and batch size increases by 𝑘 -fold, lowering the theoretical complexity of self-attention to 1 𝑘 , while cross attention complexity remains unchanged.

The Calculation process of two skip operations is shown Fig. 4. In Single Skip operation, the elements located at positions [ 0 , 𝑘 , 2 ⁢ 𝑘 , 3 ⁢ 𝑘 , … ] , [ 1 , 𝑘 + 1 , 2 ⁢ 𝑘 + 1 , 3 ⁢ 𝑘 + 1 , … ] , …, [ 𝑘 − 1 , 2 ⁢ 𝑘 − 1 , 3 ⁢ 𝑘 − 1 , … ] are bundled into a sequence, e.g., each token performs attention with tokens spaced 𝑘 − 1 apart.

In Group Skip operation, the elements at positions [ ( 0 , 1 , … , 𝑘 − 1 ) , ( 𝑘 2 , 𝑘 2 + 1 , … , 𝑘 2 + 𝑘 − 1 ) , ( 2 ⁢ 𝑘 2 , 2 ⁢ 𝑘 2 + 1 , … , 2 ⁢ 𝑘 2 + 𝑘 − 1 ) , … ] , [ ( 𝑘 , 𝑘 + 1 , … , 2 ⁢ 𝑘 − 1 ) , ( 𝑘 2 + 𝑘 , 𝑘 2 + 𝑘 + 1 , … , 𝑘 2 + 2 ⁢ 𝑘 − 1 ) , ( 2 ⁢ 𝑘 2 + 𝑘 , 2 ⁢ 𝑘 2 + 𝑘 + 1 , … , 2 ⁢ 𝑘 2 + 2 ⁢ 𝑘 − 1 ) , … ] , …, [ ( 𝑘 2 − 𝑘 , 𝑘 2 − 𝑘 − 1 , … , 𝑘 2 − 1 ) , ( 2 ⁢ 𝑘 2 − 𝑘 , 2 ⁢ 𝑘 2 − 𝑘 − 1 , … , 2 ⁢ 𝑘 2 − 1 ) , ( 3 ⁢ 𝑘 2 − 𝑘 , 3 ⁢ 𝑘 2 − 𝑘 − 1 , … , 3 ⁢ 𝑘 2 − 1 ) , … ] are bundled as a sequence. Concretely, we first group adjacent tokens in segments of length 𝑘 , then bundle these groups with other groups that are spaced 𝑘 − 1 groups apart into a sequence. For instance, in [ ( 0 , 1 , … , 𝑘 − 1 ) , ( 𝑘 2 , 𝑘 2 + 1 , … , 𝑘 2 + 𝑘 − 1 ) , ( 2 ⁢ 𝑘 2 , 2 ⁢ 𝑘 2 + 1 , … , 2 ⁢ 𝑘 2 + 𝑘 − 1 ) , … ] , each set of indices in parentheses represents a group, and each group is then connected with another group offset by 𝑘 − 1 groups to form one sequence. We notice that the main difference between the Group Skip operation and traditional Skip + Window Attention is our operation involves not only grouping but also skipping, which is ignored by previous attempts. Concretely, Window Attention only groups adjacent tokens without connecting skipped groups into one sequence. The distinctions among these attention methods are illustrated in Fig. 5, with dark tokens representing the tokens involved in one attention calculation.

We further notice that the attention in 2+1D DiT corresponds to 𝑘

𝐻 ⁢ 𝑊 (Skip operation in Group Skip has no effect when 𝑇 ≪ 𝐻 ⁢ 𝑊 ), while Full 3D DiT corresponds to 𝑘

1 . In Skiparse Attention, 𝑘 is typically chosen to be close to 1, yet far smaller than 𝐻 ⁢ 𝑊 , making the Skiparse Attention approach the effectiveness of Full 3D Attention while decreasing the computation cost.

Additionally, we propose the concept of Average Attention Distance ( AD avg ) to quantify how closely a given attention aligns with Full 3D Attention. This concept is defined as follows: If at least 𝑚 attention calculations are required to establish a connection between any two tokens A&B, the attention distance A → B is 𝑚 (Noticing that the attention distance between a token and itself is zero). Thus the AD avg for an attention mechanism is the mean of the attention distances across all token directions in input sequences, and AD avg reflects the modeling efficiency among all tokens for the corresponding attention method. To calculate the specific AD avg of different attention methods, we can first identify which tokens have an attention distance of 1, and tokens with an attention distance of 2 can be determined. Therefore, we give the AD avg and calculation process following:

For Full 3D Attention, each token can interact with any other token in one attention calculation, resulting in the AD avg

1 .

For 2+1D Attention, any two tokens can be directed with an attention distance between 1 and 2. In the 2 ⁢ 𝑁 Block, attention operates over the ( 𝐻 , 𝑊 ) dimensions, where tokens within this region have an attention distance of 1. In the 2 ⁢ 𝑁 + 1 Block, attention operates along the 𝑇 dimension, and attention distance is also 1 for these tokens. The total number of tokens with an attention distance of 1 is ( 𝐻 ⁢ 𝑊 + 𝑇 − 1 ) − 1

𝐻 ⁢ 𝑊 + 𝑇 − 2 . Therefore, AD avg of 2+1D Attention is:

AD avg

1 𝑇 ⁢ 𝐻 ⁢ 𝑊 [ 1 × 0 + ( 𝐻 𝑊 + 𝑇 − 2 ) × 1

(8)

( 𝑇 𝐻 𝑊 − ( 𝐻 𝑊
𝑇 − 1 ) ) × 2 ]

= 2 − ( 1 𝑇 + 1 𝐻 ⁢ 𝑊 ) .

For Skip + Window Attention, aside from the token itself, there are 𝑇 ⁢ 𝐻 ⁢ 𝑊 𝑘 − 1 tokens with an attention distance of 1 in the 2 ⁢ 𝑁 Block, and 𝑘 − 1 tokens with an attention distance of 1 in the 2 ⁢ 𝑁 + 1 Block. Thus, the total number of tokens with an attention distance of 1 is 𝑇 ⁢ 𝐻 ⁢ 𝑊 𝑘 + 𝑘 − 2 . Therefore, AD avg of Skip + Window Attention is:

AD avg

1 𝑇 ⁢ 𝐻 ⁢ 𝑊 [ 1 × 0 + ( 𝑇 ⁢ 𝐻 ⁢ 𝑊 𝑘 + 𝑘 − 2 ) × 1

(9)

( 𝑇 𝐻 𝑊 − ( 𝑇 ⁢ 𝐻 ⁢ 𝑊 𝑘
𝑘 − 1 ) ) × 2 ]

= 2 − ( 1 𝑘 + 𝑘 𝑇 ⁢ 𝐻 ⁢ 𝑊 ) .

In Skiparse Attention, aside from the token itself, 𝑇 ⁢ 𝐻 ⁢ 𝑊 𝑘 − 1 tokens have an attention distance of 1 in the 2 ⁢ 𝑁 Block, and 𝑇 ⁢ 𝐻 ⁢ 𝑊 𝑘 − 1 tokens have an attention distance of 1 in the 2 ⁢ 𝑁 + 1 Block. Notably, 𝑇 ⁢ 𝐻 ⁢ 𝑊 𝑘 2 − 1 tokens can establish an attention distance of 1 in both blocks and should not be counted twice. Therefore, AD avg in Skiparse Attention is:

AD avg

1 𝑇 ⁢ 𝐻 ⁢ 𝑊 [ 1 × 0 + ( 2 ⁢ 𝑇 ⁢ 𝐻 ⁢ 𝑊 𝑘 − 2 − ( 𝑇 ⁢ 𝐻 ⁢ 𝑊 𝑘 2 − 1 ) ) × 1

(10)

( 𝑇 𝐻 𝑊 − ( 2 ⁢ 𝑇 ⁢ 𝐻 ⁢ 𝑊 𝑘 − 𝑇 ⁢ 𝐻 ⁢ 𝑊 𝑘 2 ) ) × 2 ]

= 2 − 2 𝑘 + 1 𝑘 2 − 1 𝑇 ⁢ 𝐻 ⁢ 𝑊 ≈ 2 − 2 𝑘 + 1 𝑘 2 .

We notice that the actual sequence length is 𝑘 ⁢ ⌈ 𝑇 ⁢ 𝐻 ⁢ 𝑊 𝑘 2 ⌉ rather than 𝑇 ⁢ 𝐻 ⁢ 𝑊 𝑘 in the Group Skip of the 2 ⁢ 𝑁 + 1 Block. Our calculation assumes the ideal case where 𝑘 ≪ 𝑇 ⁢ 𝐻 ⁢ 𝑊 and 𝑇 ⁢ 𝐻 ⁢ 𝑊 mod 𝑘

0 , yielding 𝑘 ⁢ ⌈ 𝑇 ⁢ 𝐻 ⁢ 𝑊 𝑘 2 ⌉

𝑘 ⋅ 𝑇 ⁢ 𝐻 ⁢ 𝑊 𝑘 2

𝑇 ⁢ 𝐻 ⁢ 𝑊 𝑘 . In practical applications, excessively large 𝑘 values are typically avoided, making this derivation a reasonably accurate approximation for general usage.

For the commonly used resolution of 93 × 512 × 512 , using a causal VAE with a 4 × 8 × 8 compression rate and a convolutional layer with a 1 × 2 × 2 kernel for patch embedding, we obtain a latent shape of 24 × 32 × 32 as input sequence for attention calculations. We summarize the characteristics of these attention types in Tab. 1, and AD avg for different attention methods when latent shape is 24 × 32 × 32 in Tab. 2. Considering the balance between computational load and Average Attention Distance, we use Skiparse Attention with 𝑘

4 in our implementations.

Table 1:Comparison of the different attention mechanisms. Across multiple comparison metrics, Skiparse Attention is closer to Full 3D Attention, giving it the best spatiotemporal modeling capability apart from Full 3D Attention. Attention Mechanisms Speed Modeling Global Attention Block Average Attention Distance Capability Computation Full 3D Attention Slow Strong All blocks Equal 1 2+1D Attention Fast Weak None block Not Equal 2 − ( 1 𝑇 + 1 𝐻 ⁢ 𝑊 )

Skip + Window Attention Middle Weak Half blocks Not Equal 2 − ( 1 𝑘 + 𝑘 𝑇 ⁢ 𝐻 ⁢ 𝑊 )

Skiparse Attention Middle Strong All blocks Equal 2 − 2 𝑘 + 1 𝑘 2 , 1 < 𝑘 ≪ 𝑇 ⁢ 𝐻 ⁢ 𝑊 Table 2:The average attention distance AD avg of different attention mechanisms. Results are calculated when the latent shape is 24 × 32 × 32 . Attention Mechanisms AD avg

Full 3D Attention 1.000 2+1D Attention 1.957 Skip + Window Attention ( 𝑘

2 ) 1.500 Skip + Window Attention ( 𝑘

4 ) 1.750 Skip + Window Attention ( 𝑘

8 ) 1.875 Skiparse Attention ( 𝑘

2 ) 1.250 Skiparse Attention ( 𝑘

4 ) 1.563 Skiparse Attention ( 𝑘

8 ) 1.766 2.2.3Training Details

Similar to previous works [opensora, chen2024pixart, blattmann2023stable], we use a multi-stage approach for model training. Starting with training an image model, our joint denoiser learns a rich understanding of static visual features, as many effective visual patterns in images also apply to videos. Benefiting from the 3D DiT architecture, all parameters transfer seamlessly from images to videos. Thus, we adopt a progressive training strategy from images to videos. For all training stages, we use v-prediction diffusion loss with zero terminal SNR [lin2024common]. We use min-snr weighting strategy [hang2023efficient] with 𝛾

5.0 to accelerate the convergence process. The text encoder has a maximum input length of 512. We use AdamW [Kingma_Ba_2014, loshchilov2019decoupledweightdecayregularization] optimizer with parameters 𝛽 1

0.9 and 𝛽 2

0.999 . Details of leveraged datasets in training stages are shown in Sec. 4

Text-to-Image Pretraining. The objective of this stage is to learn a visual prior that enables fast convergence when training on videos, reducing dependency on large-scale video datasets. Since the weights of Full 3D Attention can efficiently transfer to Skiparse Attention, we first train a Full 3D Attention model on 256 × 256 images to generate text-conditioned images, for approximately 150k steps. We then inherit the model weights and replace Full 3D Attention with Skiparse Attention, allowing tuning from a 3D dense attention model to a sparse attention model. The tuning process involves around 100k steps, a batch size of 1024, and a learning rate of 2e-5. Image datasets includes SAM, Anytext, and Human-images.

Text-to-Image&Video Pretraining. We jointly train on images and videos, with a maximum shape of 93 × 640 × 640 . The pretraining process includes approximately 200k steps, a batch size of 1024, and a learning rate of 2e-5. Image data consists almost entirely of SAM from version 1.2.0, and the leveraged video dataset is the original Panda70M.

Text-to-Video Fine-tuning. The model nearly converges around 100k steps, with no substantial gains observed by 200k steps. Following the procedures in Sec. 4, we refine the data by cleaning and re-captioning. Fine-tuning is conducted with the filtered Panda70M and additional high-quality data at a fixed resolution of 93 × 352 × 640 . This process runs for 30k steps with a learning rate of 1e-5, utilizing 256 NPUs/GPUs with a total batch size of 1024.

Figure 6:Overview of our Image Condition Controller. Our Controller unifies multiple image conditional tasks including image-to-video, video transition, and video continuation in one framework when giving masks are changed. Figure 7:Overview of our Structure Condition Controller. The structure Controller contains two light components including an encoder that focuses on extracting a high-level representation from the structural signals and a projector that transforms such representation into injection features. Finally, we directly add obtained injection features to the pre-trained model for structure control. 2.3Conditional Controllers 2.3.1Image Condition Controller

Inspired by Stable Diffusion Inpainting [stable_diffusion], we regard the image conditional tasks as an inpainting task in the temporal dimension for a more flexible training paradigm.

The image condition model is initialized by our text-to-video weights. As shown in Fig. 6, it adds two additional inputs including given mask and masked video, which are concatenated with the latent noise and then fed into the Denoiser. For the given mask, instead of employing VAE for encoding, we adopt the “reshape” operation to align latent dimensions due to the temporal down-sampling in VAE will damage the control accuracy of masks. For the masked video, we multiply the original video by the given mask and then input the multiplied video into VAE for encoding.

Unlike previous works based on 2+1D Attention, which inject semantic features of images (usually extracted via CLIP [clip]) into the UNet or DiT to enhance cross-frame stability [blattmann2023stable, dynamicrafter, easyanimate], we simply alter the input channels of the DiT without incorporating semantic features for control. We observe that leveraging various semantic injection methods can not noticeably improve the generated results while instead limiting the range of motion, thus we discard the image semantic injection module in our experiments.

{wrapfigure}

l0.4 Different types of masks for image-conditioned generation. Black masks indicate corresponding frames are retained, while white masks indicate frames are masked.

Training Details. For training configuration, we adopt the same settings as the text-to-video model, including v-prediction, zero terminal SNR, and min-snr weighting strategy, with parameters consistent with the text-to-video model. We also use the AdamW optimizer with a constant learning rate of 1e-5 and utilize 256 NPUs a batch size fixed at 512.

Thanks to the flexibility of different mask types in our inpainting framework, we design a progressive training strategy that gradually increases the difficulty of training tasks as shown in Fig. 2.3.1, which strategy can lead to smoother training curves and improve motion consistency. The masks used during training are set as follows: (1) Clear: Retain all frames. (2) T2V: Discard all frames. (3) I2V: Retain only the first frame but discard the rest. (4) Transition: Retain only the first and last frames but discard the rest. (5) Continuation: Retain the first 𝑛 frames but discard the rest. (6) Random: Retain 𝑛 randomly selected frames but discard the rest. Concretely, Our progressive training strategy includes two stages. In Stage 1, we train on multiple simple tasks at a low resolution. In Stage 2, we train the image-to-video and video transition tasks at a higher resolution.

Stage 1: Any resolution and duration within 93 × 102400 ( 320 × 320 ), using unfiltered motion and aesthetic low-quality data. The task ratios at different steps are as follows:

T2V 10%, Continuation 40%, Random 40%, Clear 10%. Ensure that at least 50% of the frames are retained during continuation and random mask, training with 4 million samples.

T2V 10%, Continuation 40%, Random 40%, Clear 10%. Ensure that at least 25% of the frames are retained during continuation and random mask, training with 4 million samples.

T2V 10%, Continuation 40%, Random 40%, Clear 10%. Ensure that at least 12.5% of the frames are retained during continuation and random mask, training with 4 million samples.

T2V 10%, Continuation 25%, Random 60%, Clear 5%. Ensure that at least 12.5% of the frames are retained during continuation and random mask, training with 4 million samples.

T2V 10%, Continuation 25%, Random 60%, Clear 5%, training with 8 million samples.

T2V 10%, Continuation 10%, Random 20%, I2V 40%, Transition 20%, training with 16 million samples.

T2V 5%, Continuation 5%, Random 10%, I2V 40%, Transition 40%, training with 10 million samples.

Stage 2: Any resolution and duration within 93 × 236544 (e.g., 480 × 480 , 640 × 352 , 352 × 640 ), using filtered motion and aesthetic high-quality data, ratios of different tasks are T2V 5%, Continuation 5%, Random 10%, I2V 40%, Transition 40%, training with 15 million samples.

After completing the two-stage training, we draw on the approach mentioned in [yang2024cogvideox], adding slight Gaussian noise to the conditional images to enhance generalization during fine-tuning, with utilizing 5 million filtered motion and aesthetic high-quality data.

2.3.2Structure Condition Controller

When imposing structural control on our retained text-to-image model, an intuitive idea is to use previous control methods [controlnet, t2iadapter, controlnet_plus_plus, sparsectrl] specified for the U-net-based base models. However, most of these methods are based on ControlNet [controlnet], which copies half of the base model to process the control signals and will increase the hardware consumption by nearly 50%. The additional consumption is immense, as the original expense of our Open-Sora Plan base model is already extremely high. Although some works [t2iadapter, controlnext] try to replace the heavy copy of the base model with a lighter network at the sacrifice of controllability, these will probably lead to poor alignment with the input structural signals and the generated video when used for our base model.

To more efficiently add structural control to our base model, we propose a novel Structure Condition Controller, as shown in Fig. 7. Specifically, we suppose the denoiser of our base model contains 𝑀 transformer blocks. For the 𝑗 -th 1 ≤ 𝑗 ≤ 𝑀 transformer block 𝒯 𝑗 in the base model, its output is a series of tokens 𝑿 𝑗 , which can be expressed as:

𝑿 𝑗

𝒯 𝑗 ⁢ ( 𝑿 𝑗 − 1 ) .

(11)

Given a structural signal 𝑪 𝑆 , the encoder ℰ extracts the high-level representation 𝑹 from 𝑪 𝑆 :

𝑹

ℰ ⁢ ( 𝑪 𝑆 ) .

(12)

Then, the projector 𝒫 , containing 𝑀 transformations with the same process, transforms 𝑹 into the injection feature 𝑭 , including 𝑀 elements, which can be expressed as:

𝒫

[ 𝒫 1 , 𝒫 2 , … ⁢ 𝒫 𝑀 ] ,
(13)
𝑭

[ 𝑭 1 , 𝑭 2 , … , 𝑭 𝑀 ] ,
(14)
𝑭 𝑗

𝒫 𝑗 ⁢ ( 𝑹 ) .

(15)

Here 𝒫 𝑗 denotes the 𝑗 transformation of 𝒫 that transform 𝑹 to 𝑭 𝑗 , the 𝑗 -th element of 𝑭 . To impose structural control on the base model, we can directly add 𝑭 𝑗 to 𝑿 𝑗 :

𝑿 𝑗

𝑿 𝑗 + 𝑭 𝑗 .

(16)

To satisfy the above equation, we should ensure the shape of 𝑭 𝑗 equals 𝑿 𝑗 . To achieve this, we use the following design of our encoder ℰ and projector 𝒫 . Specifically, in the encoder ℰ , we first downsample 𝐶 𝑆 to make its shape the same as 𝒁 𝒕 with a tiny 3D convolution-based network. Then, we flatten 𝐶 𝑆 to tokens with the same shape as 𝑋 𝑗 ⁢ ( 1 ≤ 𝑗 ≤ 𝑀 ) . After that, to obtain 𝑹 , these tokens are processed by 𝐾 transformer blocks, which maintain the token’s shape. For the projector 𝒫 , we only need to promise 𝒫 𝑗 will not change the token shape of 𝑹 . Thus, we design 𝒫 𝑗 as a token-wise transformation with the same input and output shape, such as a linear FC-layer or two-layer MLP, which is efficient and can maintain the token shape.

Training Details. We utilize the Panda70M dataset to train our Structure Controller. Given a video clip, we use the specified signal extractors to extract the corresponding structural control signals. Specifically, we extract the canny, depth, and sketch, by canny detector [canny], Midas [midas], and PiDiNet [pidinet], respectively. We train our Structure Controller for 20k steps, on 8 NPUs/GPUs, with a total batch size of 16, and a learning rate of 4e-6.

3Assistant Strategies 3.1Min-Max Token Strategy

To achieve efficient processing on hardware, deep neural networks are typically trained with batched inputs, meaning the shape of the training data is fixed. Traditional methods adopt two approaches including resizing images or padding images to a fixed size. However, both approaches have drawbacks, e.g., the former loses useful information, while the latter has low computational efficiency. Generally, there are three methods for training with variable token counts: Patch n’ Pack [dehghani2024patch, yang2024cogvideox], Bucket Sampler [chen2023pixartalpha, chen2024pixart, opensora], and Pad-Mask [Lu2024FiT, wang2024fitv2].

Patch n’ Pack. By packing multiple samples, this method addresses the fixed sequence length limitation. Patch n’ Pack defines a new maximum length, and tokens from multiple data instances are packed into this new data. As a result, the original data is preserved while enabling training with arbitrary resolutions. However, this method introduces significant intrusion into the model code, making it difficult to adapt in fields where the model architecture is not yet stable.

Bucket Sampler. This method packs data of different resolutions into buckets and samples batches from the buckets to ensure all data in a batch have the same resolution. It incurs minimal intrusion into the model code, primarily requiring modifications to the data sampling strategy.

Pad-Mask. This method sets a maximum resolution, pads all data to this resolution, and generates a corresponding mask to exclude loss from the masked areas. While conceptually simple, it has low computational efficiency.

We believe current video generation models are still in an exploratory phase. Patch n’ Pack incurs significant intrusion into the model code, leading to unnecessary development costs. Pad-mask has low computational efficiency, which wastes resources in dense computations like video. The bucket strategy, while requiring no changes to the model code, leads to greater loss oscillation as token count variation increases (with more resolution types), indicating higher training instability. Given a maximum token 𝑚 , resolution stride 𝑠 , and a set of possible resolution ratios ℛ

{ ( 𝑟 1 ℎ , 𝑟 1 𝑤 ) , ( 𝑟 2 ℎ , 𝑟 2 𝑤 ) , … , ( 𝑟 𝑛 ℎ , 𝑟 𝑛 𝑤 ) } , we propose the Min-Max Token strategy for tacking mentioned issues. We notice that 𝑠

8 × 2 is the multiples of spatial downsampling rate in VAE and convolution stride in denoiser, and there are five common resolutions: 1 1 , 3 4 , 4 3 , 9 16 and 16 9 in practical needs. For each ratio ( 𝑟 𝑖 ℎ , 𝑟 𝑖 𝑤 ) in ℛ , 𝑟 𝑖 ℎ and 𝑟 𝑖 𝑤 are required to be coprime positive integers. The height ℎ and width 𝑤 are defined as ℎ

𝑟 𝑖 ℎ ⋅ 𝑘 ⋅ 𝑠 and 𝑤

𝑟 𝑖 𝑤 ⋅ 𝑘 ⋅ 𝑠 , where is the scaling factor 𝑘 to be determined. The total token count 𝑛 satisfies the constraint 𝑛

ℎ ⋅ 𝑤 ≤ 𝑚 . Substituting the expressions for ℎ and 𝑤 , we get:

𝑛 𝑖

( 𝑟 𝑖 ℎ ⋅ 𝑘 ⋅ 𝑠 ) ⋅ ( 𝑟 𝑖 𝑤 ⋅ 𝑘 ⋅ 𝑠 )

𝑟 𝑖 ℎ ⋅ 𝑟 𝑖 𝑤 ⋅ 𝑘 2 ⋅ 𝑠 2 ,

(17)

so the constraint becomes:

𝑟 𝑖 ℎ ⋅ 𝑟 𝑖 𝑤 ⋅ 𝑘 2 ⋅ 𝑠 2 ≤ 𝑚 .

(18)

Taking the square root of both sides, to ensure 𝑘 is an integer, we obtain the upper bound result for 𝑘 :

𝑘 𝑖

⌊ 𝑚 𝑟 𝑖 ℎ ⋅ 𝑟 𝑖 𝑤 ⋅ 𝑠 2 ⌋ .

(19)

The set of minimum token 𝑛 is then expressed as:

𝑛

min ⁡ ( { 𝑟 𝑖 ℎ ⋅ 𝑟 𝑖 𝑤 ⋅ 𝑘 𝑖 2 ⋅ 𝑠 2 ∣ ( 𝑟 𝑖 ℎ , 𝑟 𝑖 𝑤 ) ∈ ℛ } ) .

(20)

For example, the max token 𝑚 is typically set as a square rootable number, such as 65536 ( 256 × 256 ), as it reliably supports a 1:1 aspect ratio. Given this, we configure 𝑠

16 , and aspect ratios of 3:4 and 9:16. The resulting min token 𝑛 is 36864 ( 144 × 256 ).

As discussed above, we implement the Min-Max Token Training combined with the Bucket Sampler using a custom data sampler to maintain a consistent token count per global batch, though token counts vary across global batches. This approach allows NPUs/GPUs to maintain nearly identical compute times, reducing synchronization overhead. The method fully decouples data sampling code from model code, providing a plug-and-play sampling strategy for multi-resolution, multi-frame data.

3.2Adaptive Gradient Clipping Strategy {wrapfigure}

r0.48 Plot of spikes in training loss. We observe loss spikes during training that could not be reproduced with a fixed seed.

In distributed model training, we often observe loss spikes as shown in Fig. 3.2, significantly degrade output quality without causing NaN errors. Unlike typical NaN errors that disrupt training, these spikes temporarily increase loss values and are followed by a return to normal levels, which occur sporadically and adversely impact model performance. These spikes arise due to various issues, including abnormal outputs from the VAE encoder, desynchronization in multi-node communication, or outliers in training data leading to large gradient norms.

(a) (b) (c) (d) (e) (f) (g) (h) Figure 8:Logging abnormal iterations during training. We resume training at step 75k and display logs from step 75k to 76k, noting an anomaly around step 75.6k. (a) Diffusion model loss during training. (b) Abnormal local batches discarded per step. (c) Gradient norm upper bound plotted based on a 3-sigma criterion. (d) Maximum gradient norm among all local batches. (e) Variance of the maximum gradient norm. Note that most steps involve values close to 0. (f) Maximum value of all processed gradient norms. (g) EMA of the maximum gradient norm. (h) EMA of the variance of the maximum gradient norm.

We attempt many methods including applying gradient clipping, adjusting the 𝛽 2 in optimizer, and reducing the learning rate, but none of these approaches resolve the issue, which appears randomly and cannot be reproduced even with a fixed seed. Playground v3 [liu2024playground] encounters the same issue and involves discarding an iteration if the gradient norm exceeds a fixed threshold. However, fixed thresholds may fail to adapt to decreasing gradient norms as training progresses. Therefore, we introduce an adaptive thresholding mechanism that leverages Exponential Moving Averages (EMA) for effective anomaly detection. Our approach mitigates the effects of spikes while preserving training stability and output quality.

Let gn 𝑖 denote the gradient norm on NPU/GPUi for 𝑖

1 , 2 , … , 𝑁 , where 𝑁 is the total number of NPUs/GPUs. We define the maximum gradient norm across all NPUs/GPUs as:

gn max

max 𝑖

1 𝑁 ⁡ gn 𝑖 .

(21)

To ensure the threshold adapts to the training dynamics, we use the EMA of the maximum gradient norm ema gn and its variance-based EMA ema var , which updated as follows:

ema gn

𝛼 ⋅ ema gn + ( 1 − 𝛼 ) ⋅ gn max ,

(22)

ema var

𝛼 ⋅ ema var + ( 1 − 𝛼 ) ⋅ ( gn max − ema gn ) 2 ,

(23)

where 𝛼 is the update rate for EMA, we set it to 0.99. We can record whether each gradient norm is abnormal based on the 3-sigma rule, denoted as 𝛿 𝑖 :

𝛿 𝑖

{ 0 ,

if ⁢ gn 𝑖 − ema gn

3 ⋅ ema var

1 ,

otherwise .

(24)

Then, the number of normal gradient norm 𝑀 can be obtained by summing the indicator functions of all NPUs/GPUs:

𝑀

∑ 𝑖

1 𝑁 𝛿 𝑖 .

(25)

For each NPU/GPU, we define the final gradient update rule based on the detection result. If an anomaly is detected for NPU/GPUi, the gradient for that NPU/GPU is set to zero, or it will be multiplied by 𝑁 𝑀 otherwise:

𝑔 𝑖 final

{ 0 ,

if ⁢ gn 𝑖 − ema gn

3 ⋅ ema var

𝑁 𝑀 ⋅ 𝑔 𝑖 ,

otherwise .

(26)

After adjusting the gradients, we apply an all-reduce operation across NPUs/GPUs to synchronize the remaining non-zero gradients. In Fig. 8, we illustrate how the moving average gradient norm addresses abnormal data. Fig. 8 (d) and Fig. 8 (e) show a sudden increase in gradient norm on a specific NPU/GPU near step 75.6k, exceeding the moving average of the maximum gradient norm (seen in Fig. 8 (c)). Consequently, the gradient for this local batch is set to zero (logged in Fig. 8 (b)). We also record the post-discard maximum gradient to confirm successful handling. Finally, the processed maximum gradient norm (logged in Fig. 8 (f)) updates the moving average of the maximum gradient norm and its variance in Fig. 8 (g) and Fig. 8 (h). As shown in Fig. 8 (a), the training loss remains stable without spikes, demonstrating that this approach effectively prevents anomalous batches from affecting the training process without discarding entire iterations.

3.3Prompt Refiner

The training dataset for the video generation model is annotated by Vision Language Models [chen2024far, wang2024qwen2], providing highly detailed descriptions of scenes and themes, with most annotations consisting of lengthy texts that differ substantially from typical user input. User input is generally less detailed and concise, containing fewer words (e.g., in VBench [vbench], most test texts contain fewer than 30 words, sometimes no more than 5 words). This discrepancy results in a significant gap compared to the textual conditions used in model training, leading to reduced video quality, semantic fidelity, and motion amplitude. To address this gap and enhance the model performance when facing shorter texts, we introduce an LLM to leverage its text expansion and creation capabilities to transform short captions into more elaborate descriptions.

Data preparation. We use GPT-4o to generate paired training texts, using specific prompts to instruct the LLM to supplement detailed actions, scene descriptions, cinematic language, lighting nuances, and environmental atmosphere. These original and LLM-augmented text pairs are then used to train the refiner model. Concretely, the instruct prompt is: rewrite the prompt:“prompt” to contain subject description action, scene description. (Optional: camera language, light and shadow, atmosphere) and conceive some additional actions to make the prompt more dynamic, making sure it’s a fluent sentence. Our data composition for fine-tuning LLM is shown in Tab. 3. Specifically, COCO [lin2014microsoft] consists of manually annotated data, while JourneyDB [sun2024journeydb] contains labels generated by a visual language model (VLM).

Table 3:Overview of utilized datasets for fine-tuning prompt refiner. Source Year Length Manual # Num COCO [lin2014microsoft] 2014 Short Yes 12k DiffusionDB [wang2022diffusiondb] 2022 Tags Yes 6k JourneyDB [sun2024journeydb] 2023 Medium No 3k Dense Captions (From Internet) 2024 Dense Yes 0.5k

Training Details. We perform LoRA fine-tuning using LLaMA 3.1 8B1, completing within 1 hour on a single NPU/GPU. Fine-tuning is conducted for just 1 epoch with a batch size of 32 and a LoRA rank of 64. The AdamW optimizer is used with 𝛽 1

0.9 , 𝛽 2

0.999 , and a learning rate of 1.5e-4.

4Data Curation Pipeline

Dataset quality is closely linked to model performance. However, some current open-source datasets, such as WebVid [bain2021frozen], Panda70M [chen2024panda], VIDAL [zhu2023languagebind] and HD-VILA [xue2022hdvila], fall short in data quality. Excessive low-quality data in training disrupts the gradient direction of model learning. In this section, we propose an efficient, structured data-processing pipeline to filter high-quality video clips from raw data. We also present dataset statistics to provide reliable direction for further data enhancement.

4.1Training Data Table 4:Data card of Open-Sora Plan v1.3. “*” denotes that the original team employs multiple models, including OFA [wang2022ofa], mPLUG-Owl [ye2023mplug], and ChatGPT [openai2023gpt4] to refine captions. “ † ” indicates that while we do not release captions generated with QWen2-VL and ShareGPT4Video, the original team has made their generated captions publicly available. Domain Dataset Source Captioner Data Caption # Num Available Available Image SAM SAM LLaVA Yes Yes 11.1M Anytext Anytext InternVL2 Yes Yes 1.8M Human LAION InternVL2 Yes Yes 0.1M Internal - QWen2-VL No No 5.0M Video VIDAL YouTube Shorts Multi-model∗ Yes Yes 2.8M Panda70M YouTube QWen2-VL Yes Yes† 21.2M ShareGPT4Video StockVideo Mixkit‡ QWen2-VL Yes Yes Pexels⋏ ShareGPT4Video 0.8M Pixabay⋎ 1

‡ https://mixkit.co, ⋏ www.pexels.com, ⋎ https://pixabay.com

As shown in Tab. 4, we obtain 11 million image-text pairs from Pixart-Alpha [chen2023pixartalpha], with captions generated by LLaVA [liu2024visual]. Additionally, we use the OCR dataset Anytext-3M [tuo2023anytext], which pairs each image with corresponding OCR characters. We filter Anytext-3M for English data, constituting about half of the entire dataset. Since SAM [kirillov2023segment] data (as used in Pixart-Alpha) includes blurred faces, we selected 160k high-quality images from Laion-5B [schuhmann2022laion] to enhance the quality of person-related content in generation. The selection criteria include high resolution, high aesthetic scores, the absence of watermarks, and the presence of people in the images.

For videos, we download approximately 21M horizontal videos from Panda70M [chen2024panda] using our filtering pipeline. For vertical data, we obtain around 3M vertical videos from VIDAL [zhu2023languagebind], sourced from YouTube Shorts. Additionally, we scrape high-quality videos from CC0-licensed websites, such as Mixkit, Pexels, and Pixabay. These open-source video sites contain no content-related watermarks.

4.2Data Filtering Strategy Table 5:Implementation details and discarded data number of different filtering steps. Curation Step Tools Thresholds Remaining Video Slicing - Each video is clipped to 16s 100% Jump Cut LPIPS [Zhang_Isola_Efros_Shechtman_Wang_2018] 32 ≤ frames number ≤ 512 97% Motion Calculation LPIPS [Zhang_Isola_Efros_Shechtman_Wang_2018] 0.001 ≤ motion score ≤ 0.3 89% OCR Cropping EasyOCR∗ 0.20 ≤ edge 89% Aesthetic Filtration Laion Aesthetic Predictor v2† 4.75 ≤ aesthetic score 49% Low-level Quality Filtration DOVER [wu2023exploring] 0 ≤ technical score 44% Motion Double-Checking LPIPS [Zhang_Isola_Efros_Shechtman_Wang_2018] 0.001 ≤ motion score ≤ 0.3 42% 1

∗ https://github.com/JaidedAI/EasyOCR

† https://github.com/christophschuhmann/improved-aesthetic-predictor

Video Slicing. Excessively long videos are not conducive to input processing, so we utilize copy stream method in ffmpeg2 to split videos into 16-second clips.

Jump Cut and Motion Calculation. We calculate the Learned Perceptual Image Patch Similarity (LPIPS) [Zhang_Isola_Efros_Shechtman_Wang_2018] between consecutive frames. Outliers are identified as cut points, while the mean value represents motion. Specifically, we utilize the decord3 library to efficiently read video frames with skipping. After reading the video, we calculate the LPIPS values to obtain a set of semantic similarities between frames, denoted as 𝑙 ∈ ℒ , and compute its mean 𝜇 and variance 𝜎 . Then, we calculate the zero score of ℒ : 𝒵

{ 𝑧

𝑙 − 𝜇 𝜎 | 𝑙 ∈ ℒ } , to obtain the set of potential anomaly indices 𝒫

{ 𝑖 | 𝑧 𝑖 > 𝑧 𝑡 ⁢ ℎ ⁢ 𝑟 ⁢ 𝑒 ⁢ 𝑠 ⁢ ℎ ⁢ 𝑜 ⁢ 𝑙 ⁢ 𝑑 , 𝑧 𝑖 ∈ 𝒵 } . We further filter the anomalies by 𝒫 𝑓 ⁢ 𝑖 ⁢ 𝑛 ⁢ 𝑎 ⁢ 𝑙

{ 𝑖 | ℒ ⁢ [ 𝑖 ] > 𝑙 𝑡 ⁢ ℎ ⁢ 𝑟 ⁢ 𝑒 ⁢ 𝑠 ⁢ ℎ ⁢ 𝑜 ⁢ 𝑙 ⁢ 𝑑 ⁢ 𝑜 ⁢ 𝑟 ⁢ ( 𝑧 𝑖 > 𝑧 𝑡 ⁢ ℎ ⁢ 𝑟 ⁢ 𝑒 ⁢ 𝑠 ⁢ ℎ ⁢ 𝑜 ⁢ 𝑙 ⁢ 𝑑 ⁢ 2 ⁢ 𝑎 ⁢ 𝑛 ⁢ 𝑑 ⁢ ℒ ⁢ [ 𝑖 ] > 𝑙 𝑡 ⁢ ℎ ⁢ 𝑟 ⁢ 𝑒 ⁢ 𝑠 ⁢ ℎ ⁢ 𝑜 ⁢ 𝑙 ⁢ 𝑑 ⁢ 2 ) , 𝑖 ∈ 𝒫 } to obtain the final set of anomaly indices. Based on our experiments, we set the parameters as 𝑧 𝑡 ⁢ ℎ ⁢ 𝑟 ⁢ 𝑒 ⁢ 𝑠 ⁢ ℎ ⁢ 𝑜 ⁢ 𝑙 ⁢ 𝑑

2.0 , 𝑙 𝑡 ⁢ ℎ ⁢ 𝑟 ⁢ 𝑒 ⁢ 𝑠 ⁢ ℎ ⁢ 𝑜 ⁢ 𝑙 ⁢ 𝑑

0.35 , 𝑧 𝑡 ⁢ ℎ ⁢ 𝑟 ⁢ 𝑒 ⁢ 𝑠 ⁢ ℎ ⁢ 𝑜 ⁢ 𝑙 ⁢ 𝑑 ⁢ 2

3.2 , 𝑙 𝑡 ⁢ ℎ ⁢ 𝑟 ⁢ 𝑒 ⁢ 𝑠 ⁢ ℎ ⁢ 𝑜 ⁢ 𝑙 ⁢ 𝑑 ⁢ 2

0.2 . To validate the efficacy of our method, we conduct a manual assessment of 2,000 videos. The result demonstrates that the accuracy meets our predetermined criteria.

OCR Cropping. We employ EasyOCR to detect subtitles in videos by sampling one frame per second. Based on our estimates for common video platforms, subtitles typically appear in the edge regions, with manual verification showing an average occurrence in 18% of these areas. Therefore, we set the maximum cropping range to 20% of both sides of video spatial size ( 𝐻 , 𝑊 ) , i.e., cropped video has ( 0.6 ⁢ 𝐻 , 0.6 ⁢ 𝑊 ) size and 36% area compared to the original video in extreme cases. We then crop subtitles appearing in the setting range, leaving any text in the central area unprocessed. We consider that text appearing in certain contexts, such as advertisements, speeches, or library settings is reasonable. In summary, we do not assume that all text in a video should be filtered out since certain words contribute significance in specific contexts, and we leave further judgments to aesthetic considerations. We notice that the OCR step only crops text areas without discarding videos.

Aesthetic Filtration. We use the Laion aesthetic predictor to assess the aesthetic score of a video. The aesthetic predictor effectively filters out videos that are blurry, low-resolution, overly exposed, excessively dark, or contain prominent watermarks or logos. We set a threshold of 4.75 to filter videos, as this value effectively removes extensive text and retains high aesthetic quality. We uniformly sample five frames from each video and average their scores to obtain the final aesthetic score. This filtering process eliminates approximately 40% of videos that do not meet human aesthetic standards.

Low-level Quality Filtration. However, even when some data have high resolutions, their visual effects can still appear very blurry or exhibit a mosaic-like quality, which is attributed to two factors: (i) Low bitrate or DPI of the video. (ii) Usage of motion blur techniques in 24 FPS videos, which simulate dynamic effects by blurring the image between frames, resulting in smoother visual motion. For these videos with absolutely low quality, aesthetic filtering struggles to eliminate them since frames are resized to a resolution of 224. We aim to utilize a metric independent of the visual content that evaluates absolute video quality, focusing on issues including compression artifacts, low bitrate, and temporal jitter. Finally, we find the technical prediction score from DOVER [wu2023exploring], selecting videos with a technical score

0, which filters out 5% of the videos.

Motion Double-Checking. In our post-check, we find that the changes in subtitles may lead to inaccuracies in motion values because the OCR cropping step occurs after detecting motion values. Therefore, we recheck the motion values and filter out videos according to average frame similarities with ℒ ¯ < 0.001 or ℒ ¯

0.3 , which account for 2%.

4.3Data Annotation

Dense captioning provides additional semantic information for each sample, enabling the model to learn specific correspondences between text and visual features. Supervised by dense caption during diffusion training, the model gradually builds a conceptual understanding of various objects and scenes. However, the cost of manual annotation for dense captions is prohibitive, so large image-language models [wang2023cogvlm, yao2024minicpm, chen2024far, chen2023sharegpt4v, lin2024moe, liu2024improved, wang2024qwen2] and large video-language models [lin2023video, chen2024sharegpt4video, wang2024qwen2, xu2024pllava, liu2024ppllava, wang2024tarsier, jin2024chat] are typically used for annotation. This capability allows the model to express complex concepts in dense captions more accurately during image and video generations.

For images, the SAM dataset has available captions generated by LLaVA. Although Anytext contains some OCR-recognized characters, these are insufficient to describe the entire image. Therefore, we use InternVL2 [chen2024far] and QWen2-VL-7B [wang2024qwen2] to generate captions for the images. The descriptions are as detailed and diverse as possible. The annotation prompt is: Combine this rough caption: “{}”, analyze the image in a comprehensive and detailed manner. “{}” can be recognized in the image.

For videos, in early versions such as Open-Sora Plan v1.1, we use ShareGPT4Video-7B [chen2024sharegpt4video] to annotate a portion of the videos. Another portion is annotated with QWen2-VL-7B [wang2024qwen2], with the input prompt: Please describe the content of this video in as much detail as possible, including the objects, scenery, animals, characters, and camera movements within the video. Please start the description with the video content directly. Please describe the content of the video and the changes that occur, in chronological order.

However, 7B caption models often generate prefixes like “This image” or “The video”. We search all such irrelevant strings and remove them.

(a) (b) Figure 9:(a) Distribution statistics of image datasets. The first row is the aesthetic scores distribution of the data, and the second row is the resolution distribution of the data. (b) Distribution statistics of video datasets. The first row is the duration distribution of the data, the second row is the aesthetic score distribution of the data, and the third row is the resolution distribution of the data. 4.4Data Statistics

Image Data. The filtered image data primarily includes Anytext, Human-images, and SAM. We have plotted the top-10 most frequent resolutions, along with histograms depicting the distribution of aesthetic scores, as shown in Fig. 9 (a). The plots indicate that the Anytext dataset has a unified resolution 512 × 512 . In contrast, Human-images and SAM datasets exhibit more diverse scores and resolutions. Human-images dataset shows a range of scores and multiple resolutions, suggesting varied content, while SAM heavily favors high resolutions 2250 × 1500 . Overall, Anytext is consistent, while Human-images and SAM offer greater diversity in both aesthetic scores and image resolutions.

Video Data. The filtered video data primarily includes Panda70M, VIDAL-10M, and several stock video websites (e.g., Pixabay, Pexels, Mixkit). We have plotted the top 10 most frequent resolutions, along with histograms depicting the distribution of video duration, aesthetic scores, and resolution across the three datasets, as shown in Fig. 9 (b). From the distribution plots, it is evident that both Panda70M and VIDAL-10M contain shorter average video durations and relatively lower aesthetic scores. In contrast, videos from stock video websites tend to have longer durations and higher aesthetic quality. Regarding resolution, the majority of videos across all three datasets are 1280 × 720 , with VIDAL-10M being a vertical video dataset (height

width), while the other two datasets are predominantly landscape (width

height).

5Results 5.1Wavelet-Flow VAE

Tab. 6 and Fig. LABEL:fig:reconstruction present both quantitative and qualitative comparisons with several open-source VAEs, including Allegro [zhou2024allegro], OD-VAE [chen2024od], and CogVideoX [yang2024cogvideox]. The experiments utilize the Panda70M [Chen_2024_CVPR] and WebVid-10M [Bain_Nagrani_Varol_Zisserman_2021] datasets. To comprehensively evaluate reconstruction performance, we adopt the Peak Signal-to-Noise Ratio (PSNR) [Hore_Ziou_2010], Learned Perceptual Image Patch Similarity (LPIPS) [Zhang_Isola_Efros_Shechtman_Wang_2018], and Structural Similarity Index Measure (SSIM) [wang2004image] as the primary evaluation metrics. Furthermore, the reconstruction Fréchet Video Distance (rFVD) [Unterthiner_Steenkiste_Kurach_Marinier_Michalski_Gelly_2019] is employed to assess visual quality and temporal coherence.

As shown in Tab. 6, WF-VAE-S achieves a throughput of 11.11 videos per second when encoding 33-frame videos at 512 × 512 resolution. This throughput surpasses CV-VAE and OD-VAE by approximately 6 × and 4 × , respectively. The memory cost reduces by nearly 5 × and 7 × compared to these baselines while achieving superior reconstruction quality. For the larger WF-VAE-L model, the encoding throughput exceeds Allegro by 7.8 × , with approximately 8 × lower memory usage, while maintaining better evaluation metrics. These results demonstrate that the WF-VAE maintains state-of-the-art reconstruction performance while substantially reducing computational costs.

We assess the impact of lossy block-wise inference on reconstruction metrics using contemporary open-source VAE implementations [yang2024cogvideox, chen2024od], as summarized in Tab. 7. Specifically, we measure reconstruction performance in terms of PSNR and LPIPS on the Panda70M dataset under both block-wise and direct inference conditions. the overlap-fusion-based tiling inference of OD-VAE results in substantial performance degradation. In contrast, CogVideoX exhibits only minor degradation due to its temporal block-wise inference with caching. Notably, our proposed Causal Cache mechanism delivers reconstruction results that are numerically identical to those of direct inference, thereby confirming its lossless reconstruction capability.

Table 6: Quantitative comparison with state-of-the-art VAEs on WebVid-10M dataset. Reconstruction metrics are evaluated on 33-frame videos at a resolution of 256 × 256. “T” and “Mem.” denote encoding throughput and Memory cost (GB), assessed on 33-frame videos at a resolution of 512 × 512. The highest result is highlighted in bold, and the second highest result is underlined. Channel Model T ↑ Mem. ↓ PSNR ↑ LPIPS ↓ rFVD ↓

4 CV-VAE 1.85 25.00 30.76 0.0803 369.23 OD-VAE 2.63 31.19 30.69 0.0553 255.92 Allegro 0.71 54.35 32.18 0.0524 209.68 WF-VAE-S(Ours) 11.11 4.70 31.39 0.0517 188.04 WF-VAE-L(Ours) 5.55 7.00 32.32 0.0513 186.00 16 CogVideoX 1.02 35.01 35.76 0.0277 59.83 WF-VAE-L(Ours) 5.55 7.00 35.79 0.0230 54.36 Table 7:Quantitative analysis of visual quality degradation induced by block-wise inference on Panda70M. BWI denotes Block-Wise Inference and experiments are conducted on 33 frames with 256 × 256 resolution. Values highlighted in red signify degradation in comparison to direct inference, whereas values highlighted in green indicate preservation of the quality. Channel Method BWI PSNR ↑ LPIPS ↓

4 OD-VAE \ding55 30.31 0.0439 \ding51 28.51 (-1.80) 0.0552(+0.011) WF-VAE-L (Ours) \ding55 32.10 0.0411 \ding51 32.10 (-0.00) 0.0411 (-0.000) 16 CogVideoX \ding55 35.79 0.0198 \ding51 35.41(-0.38) 0.0218(+0.002) WF-VAE-L (Ours) \ding55 35.87 0.0175 \ding51 35.87 (-0.00) 0.0175 (-0.000) Table 8:Quantitative comparison of Open-Sora Plan and other state-of-the-art methods. “*” donates we use our prompt refiner to get results. Model Size Aesthetic Action Object Spatial Scene Multiple CH GPT4o Quality Class Objects Score MTScore OpenSora v1.2 1.2B 56.18 85.8 83.37 67.51 42.47 58.41 51.87 2.50 CogVideoX-2B 1.7B 58.78 89.0 78.00 53.91 38.59 48.48 38.60 3.09 CogVideoX-5B 5.6B 56.46 77.2 76.85 45.89 41.44 46.43 48.45 3.36 Mochi-1 10.0B 56.94 94.6 86.51 69.24 36.99 50.47 28.07 3.76 OpenSoraPlan v1.3 2.7B 59.00 81.8 70.97 44.46 28.56 35.87 71.00 2.64 OpenSoraPlan v1.3∗ 2.7B 60.70 86.4 84.72 49.63 52.92 44.57 68.39 2.95 5.2Text-to-Video

We evaluate the quality of our video generation model using VBench [vbench] and ChronoMagic-Bench-150 [chronomagic_bench]. VBench, a commonly used metric in video generation, deconstructs “video generation quality” into several clearly defined dimensions, allowing for a fine-grained, objective assessment. However, many metrics are overly detailed and yield uniformly high scores across models, offering limited reference value. Consequently, we select Object Class, Multiple Object, and Human Action dimensions to evaluate the semantic fidelity of generated objects and human actions. Aesthetic quality is used to assess spatial generation effects, while Spatial relationship reflected the model’s understanding of spatial relationships. For motion amplitude, we adopted ChronoMagic-Bench since motion evaluation metrics in VBench are considered inadequate.

Tab. 8 compares the performance of the Open-Sora Plan with other state-of-the-art models. Results indicate that the Open-Sora Plan performs exceptionally well in video generation quality, and it has significant advantages over other models in terms of aesthetic quality, smoothness, and scene restoration fidelity. In addition, our model can automatically optimize the text prompts to further improve the generation quality.

5.3Condition Controllers

Image-to-Video. The video generation capability of image-to-video depends significantly on the performance of the base model and the quality of the initial frame, resulting in challenges in establishing fully objective evaluation metrics. To illustrate the generation ability of Open-Sora Plan, we select several showcases, as shown in Fig. LABEL:fig:_showcase_i2v, demonstrating that our model exhibits excellent image-to-video generation capabilities and realistic motion dynamics. Furthermore, We compare the image-to-video results of several state-of-the-art methods in Fig. LABEL:fig:_compre_i2v, highlighting that Open-Sora Plan strikes an exceptional balance between the control information of the initial frame and the text. Our method maintains semantic consistency while ensuring high visual quality, demonstrating superior expressiveness compared to other models.

Figure 10:Our structure controller can generate high-quality videos conditioned by specified structural signals corresponding to arbitrary frames.

Structure-to-Video. As shown in Fig. 10, our structure condition controller enables the Open-Sora Plan text-to-image model to generate high-quality videos whose any frames (first frame, a few frames, all frames, etc.) can be accurately controlled by given structural signals (canny, depth, sketch, etc.).

5.4Prompt Refiner {wrapfigure}

r0.5 Ablations results for leveraging the prompt refiner in VBench. Evaluated videos are generated in 480p. The Open-Sora Plan leverages a substantial proportion of synthetic labels during training, resulting in superior performance in dense captioning tasks compared to shorter prompts. However, the evaluation prompts or user inputs are often brief, limiting the ability to accurately assess the model’s true performance. Following DALL-E 3 [Dalle3], we report evaluation results where our prompt refiner is employed for rewriting input prompts.

During the evaluation, we observe notable improvements in most VBench [vbench] metrics when using prompt refiner, particularly in action accuracy and object description. Fig. 5.4 provides a radar chart that visually highlights the effectiveness of the prompt refiner. Specifically, the performance in human action generation and spatial relationship depiction improved by more than 5%. The semantic adherence for single-object and multi-object generation increased by 15% and 10%, respectively. Additionally, the score for scenery generation increased by 25%. Furthermore, our prompt refiner can translate multilingual into English, allowing the diffusion model to leverage training data and text encoders in English while supporting various languages for inference.

6Limitation and Future Work 6.1Wavelet-Flow VAE

Our decoder architecture is modeled after the design proposed by [rombach2022high], resulting in a greater number of parameters in the decoder compared to the encoder. While the computational cost remains manageable, we consider these additional parameters to be redundant. Consequently, in future work, we plan to streamline the model to fully exploit the advantages of our architecture.

6.2Transformer Denoiser

The current 2B model in version 1.3.0 shows performance saturation during the later stages of training. However, our model performs poor in understanding physical laws (e.g., a cup overflowing with milk, a car moving forward, or a person walking), thus we have three hypotheses:

•

Joint training of images and videos. Models such as Open-Sora v1.2 [opensora], EasyAnimate v4 [xu2024easyanimate], and Vchitect-2.04 can easily generate high-visual-quality videos, possibly due to their direct inheritance of image weights (Pixart-Sigma [chen2024pixart], HunyuanDiT [li2024hunyuan], SD3 [esser2024scaling]). They train the model with a small amount of video data to learn how to flow along the temporal dimension based on 2D images. However, we train images from scratch with only 10M-level data, which is far from sufficient. In recent work on Allegro [zhou2024allegro], they fine-tuned a better text-to-image model based on the T2I weights from Open-Sora Plan v1.2, achieving improved text-to-video results. We have two hypotheses regarding the training strategy: (i) Start joint training from scratch, with images significantly outnumbering videos; (ii) First train a high-quality image model and then use joint training, with a higher proportion of videos at that stage. Considering the learning path and training costs, the second approach may offer more decoupling, while the first aligns better with scaling laws.

•

The model still needs to scale. By observing the differences between CogVideoX-2B [yang2024cogvideox] and its 5B variant, we can discover that the 5B model understands more physical laws than the 2B model. We speculate that instead of spending excessive effort designing for smaller models, it may be more effective to leverage scaling laws to solve these issues. In the next version, we will scale up the model to explore the boundaries of video generation. We currently have two plans: (i) Continue using the Deepspeed [rasley2020deepspeed]/FSDP [zhao2023pytorch] approach, sharding the EMA and text encoder across ranks with Zero3 [rasley2020deepspeed], which is sufficient for training 10-15B models. (ii) Adopting MindSpeed5/Megatron-LM [shoeybi2019megatron] for various parallel strategies, enabling us to scale the model up to 30B.

•

Supervised loss in training. Flow Matching [lipman2022flow] avoids the stability issues in Denoising Diffusion Probabilistic Models [ho2020denoising] (DDPM) when the timestep approaches 0, addressing the zero-terminal signal-to-noise ratio problem [lin2024common]. Recent works [opensora, polyak2024movie, esser2024scaling] also show that the validation loss in Flow Matching indicates whether the model is converging in the right direction, which is crucial for assessing model training progress. Whether flow-based models are more suitable than v-prediction models requires further ablation studies.

In addition to expanding the model and data scale, we will also explore other efficient algorithm implementations and improved evaluation metrics:

•

Exploring more efficient architectures. Although Skiparse Attention significantly reduces FLOPs during computation, these advantages are only noticeable with longer sequence lengths (e.g., resolutions above 480P). Since most pre-training is conducted at a lower resolution (e.g., around 320 pixels), the Skiparse Attention operation has not achieved the desired acceleration ratio in this phase. In the future, we will explore more efficient training strategies to address this issue.

•

Introducing more parallelization strategies. In Movie Gen [polyak2024movie], the role of various parallelization strategies in accelerating training for video generation models is highlighted. However, Open-Sora Plan v1.3.0 currently only employs data parallelism (DP). In the future, we plan to explore additional parallelization strategies to enhance training efficiency. Additionally, in Skiparse Attention, each token only needs to attend to at most the same 2 𝑘 − 1 𝑘 2 tokens throughout, without requiring access to other tokens. This operation naturally suits a sequence parallelization strategy. However, the efficient implementation of this sequence parallelization in code remains a topic for further exploration.

•

Establishing reliable evaluation metrics. Although works like Vbench [vbench] and Chronomagic Bench [chronomagic_bench] have proposed metrics to automate the evaluation of video model outputs, these metrics still cannot fully replace human review [polyak2024movie]. Human evaluation is labor-intensive and incurs significant costs, making it less feasible at scale. Therefore, developing more accurate and reliable automated metrics remains a key area for future research, and we will prioritize this in our work.

6.3Data

Despite ongoing improvements to our training data, the current dataset still faces several significant limitations in terms of data diversity, temporal modeling, video quality, and cross-modal information. We discuss these limitations and outline the corresponding directions for future works:

•

Lack of Data Diversity and Complexity. The current dataset predominantly covers specific domains such as simple actions, human faces, and a narrow range of scene types. We randomly sampled 2,000 videos from Panda70M and conducted manual verification, finding that less than 1% featured cars in motion, and there were even fewer than 10 videos of people walking. Approximately 80% of the videos consist of half-body conversations with multiple people in front of the camera. Therefore, we speculate that the narrow data domain of Panda70M restricts the model’s ability to generate many scenarios. Consequently, it lacks the ability to generate complex, dynamic scenes involving realistic human movement, object deformations, and intricate natural environments. This limitation hinders the model’s capacity to produce diverse and complex video content. Future work will focus on expanding the dataset to encompass a broader spectrum of dynamic and realistic environments, including more complex human interactions and dynamic physical effects. This expansion aims to improve the model’s generalization ability and facilitate the generation of high-quality, varied dynamic videos.

•

Lack of Camera Movement, Video Style, and Motion Speed Annotations. The current dataset lacks annotations for key dynamic aspects of video content, such as camera movement, video style, and motion speed. These annotations are essential for capturing the varied visual characteristics and movement dynamics within videos. Without them, the dataset may not fully support tasks that require detailed understanding of these elements, limiting the model’s ability to handle diverse video content. In future work, we will include these annotations to enhance the dataset’s versatility and improve the model’s ability to generate more contextually rich video content.

•

Limitations in Video Resolution and Quality. Although the dataset includes videos at common resolutions (e.g., 720P), these resolutions are insufficient for high-quality video generation tasks, such as generating detailed virtual characters or complex, high-fidelity scenes. The resolution and quality of the current dataset become limiting factors when generating fine-grained details or realistic dynamic environments. To address this limitation, future work should aim to incorporate high-resolution videos (e.g., 1080P, 2K), which will enable the generation of higher-quality videos with enhanced visual detail and realism.

•

Lack of Cross-Modal Information. The dataset predominantly focuses on video imagery and lacks complementary modalities such as audio or other forms of multi-modal data. This absence of cross-modal information limits the flexibility and applicability of generative models, particularly in tasks that involve speech, emotions, or contextual understanding. Future research should focus on integrating multi-modal data into the dataset. This will enhance the model’s ability to generate richer, more contextually nuanced content, thereby improving the overall performance and versatility of the generative system.

7Conclusion

We present Open-Sora Plan, our open-source high-quality and long-duration video generation project in this work. In the framework aspect, we decompose the entire video generation model into a Wavelet-Flow Variational Autoencoder, a Joint Image-Video Skiparse Denoiser, and various condition controllers. In the strategy aspect, we carefully design a min-max token strategy for efficient training, an adaptive gradient clipping strategy for preventing outflow gradients, and a prompt refiner for obtaining more appreciative results. Furthermore, we propose a multi-dimensional data curation pipeline for automatic high-quality data exploitation. While our Open-Sora Plan achieving a remarkable milestone, we will make more effort to promote the progress of the high-quality video generation research area and open-source community.

Contributors and Acknowledgements Contributors

Bin Lin1, Yunyang Ge1, Xinhua Cheng1, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, Tanghui Jia, Junwu Zhang, Zhenyu Tang, Yatian Pang, Bin She, Cen Yan, Zhiheng Hu, Xiaoyi Dong, Lin Chen, Zhang Pan, Xing Zhou, Shaoling Dong, Yonghong Tian, Li Yuan

Project Lead

Li Yuan

Acknowledgements

We sincerely appreciate Zesen Cheng, Chengshu Zhao, Zongying Lin, Yihang Liu, Ziang Wu, Peng Jin, Hao Li for their valuable supports for our Open-Sora Plan project.

Report Issue Report Issue for Selection Generated by L A T E xml Instructions for reporting errors

We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:

Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.

Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.

Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.

Xet Storage Details

Size:: 82 kB
Xet hash:: 4d0326e20e7c936aacbd0f0295fb023bcb5476c7b720a1899363d431153f7a1b

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

Preliminary. The multi-level Haar wavelet transform decomposes video signals by applying scaling filter 𝐡

1 2 ⁢ [ 1 , 1 ] and wavelet filter 𝐠

𝐒 𝑖 ⁢ 𝑗 ⁢ 𝑘 ( 𝑙 )

where 𝑓 𝑖 , 𝑓 𝑗 , 𝑓 𝑘 ∈ 𝐡 , 𝐠 represent the filters applied along each dimension, and ∗ represents the convolution operation. The transform begins with 𝐒 ( 0 )

𝐕 , and for subsequent layers, 𝐒 ( 𝑙 )

𝐒 ℎ ⁢ ℎ ⁢ ℎ ( 𝑙 − 1 ) , indicating that each layer operates on the low-frequency component from the previous layer. At each decomposition layer 𝑙 , the transform produces eight sub-band components: 𝒲 ( 𝑙 )

ℒ 𝑊 ⁢ 𝐿

ℒ

𝜆 adv

where ∇ 𝐺 𝐿 ⁢ [ ⋅ ] represents the gradient with respect to the final layer of the decoder, and 𝛿

𝑇 𝑐 ⁢ 𝑎 ⁢ 𝑐 ⁢ ℎ ⁢ 𝑒 ⁢ ( 𝑚 )

This method necessitates that ( 𝑇 − 𝑘 𝑡 ) is divisible by 𝑠 𝑡 and ( 𝑇 − 1 ) mod 𝑠 𝑡

0 . We given a illustrated sample for understanding in Fig. 2.1, with 𝑘 𝑡

3 , 𝑠 𝑡

1 , 𝑇 𝑐 ⁢ ℎ ⁢ 𝑢 ⁢ 𝑛 ⁢ 𝑘

4 , 𝑇 𝑐 ⁢ 𝑎 ⁢ 𝑐 ⁢ ℎ ⁢ 𝑒 ⁢ ( 𝑚 )

Training Details. We utilize the AdamW optimizer [Kingma_Ba_2014, loshchilov2019decoupledweightdecayregularization] with parameters 𝛽 1

0.9 and 𝛽 2

1 , 𝑘 ℎ

2 and 𝑘 𝑤

2 , with strides matching the kernel sizes, resulting in a total of 𝐿

𝐗 𝐢 rope

𝐗 final

where Concat ⁡ ( ⋅ ) denotes the concatenate operation and 𝐗 final ∈ ℝ 𝐵 × 𝐿 × 𝐷 . When 𝑛

1 , it is equivalent to applying RoPE on a 1D sequence in large language models. When 𝑛

2 , it can be viewed as 2D RoPE applied along the height and width directions of an image. When 𝑛

Figure 4:Calculation process of Skiparse Attention with sparse ratio 𝑘

We further notice that the attention in 2+1D DiT corresponds to 𝑘

𝐻 ⁢ 𝑊 (Skip operation in Group Skip has no effect when 𝑇 ≪ 𝐻 ⁢ 𝑊 ), while Full 3D DiT corresponds to 𝑘

For Full 3D Attention, each token can interact with any other token in one attention calculation, resulting in the AD avg

AD avg

AD avg

AD avg

We notice that the actual sequence length is 𝑘 ⁢ ⌈ 𝑇 ⁢ 𝐻 ⁢ 𝑊 𝑘 2 ⌉ rather than 𝑇 ⁢ 𝐻 ⁢ 𝑊 𝑘 in the Group Skip of the 2 ⁢ 𝑁 + 1 Block. Our calculation assumes the ideal case where 𝑘 ≪ 𝑇 ⁢ 𝐻 ⁢ 𝑊 and 𝑇 ⁢ 𝐻 ⁢ 𝑊 mod 𝑘

0 , yielding 𝑘 ⁢ ⌈ 𝑇 ⁢ 𝐻 ⁢ 𝑊 𝑘 2 ⌉

𝑘 ⋅ 𝑇 ⁢ 𝐻 ⁢ 𝑊 𝑘 2

Full 3D Attention 1.000 2+1D Attention 1.957 Skip + Window Attention ( 𝑘

2 ) 1.500 Skip + Window Attention ( 𝑘

4 ) 1.750 Skip + Window Attention ( 𝑘

8 ) 1.875 Skiparse Attention ( 𝑘

2 ) 1.250 Skiparse Attention ( 𝑘

4 ) 1.563 Skiparse Attention ( 𝑘

5.0 to accelerate the convergence process. The text encoder has a maximum input length of 512. We use AdamW [Kingma_Ba_2014, loshchilov2019decoupledweightdecayregularization] optimizer with parameters 𝛽 1

0.9 and 𝛽 2

𝑿 𝑗

𝑹

𝒫

[ 𝒫 1 , 𝒫 2 , … ⁢ 𝒫 𝑀 ] , (13) 𝑭

[ 𝑭 1 , 𝑭 2 , … , 𝑭 𝑀 ] , (14) 𝑭 𝑗

𝑿 𝑗

{ ( 𝑟 1 ℎ , 𝑟 1 𝑤 ) , ( 𝑟 2 ℎ , 𝑟 2 𝑤 ) , … , ( 𝑟 𝑛 ℎ , 𝑟 𝑛 𝑤 ) } , we propose the Min-Max Token strategy for tacking mentioned issues. We notice that 𝑠

𝑟 𝑖 ℎ ⋅ 𝑘 ⋅ 𝑠 and 𝑤

𝑟 𝑖 𝑤 ⋅ 𝑘 ⋅ 𝑠 , where is the scaling factor 𝑘 to be determined. The total token count 𝑛 satisfies the constraint 𝑛

𝑛 𝑖

( 𝑟 𝑖 ℎ ⋅ 𝑘 ⋅ 𝑠 ) ⋅ ( 𝑟 𝑖 𝑤 ⋅ 𝑘 ⋅ 𝑠 )

𝑘 𝑖

𝑛

For example, the max token 𝑚 is typically set as a square rootable number, such as 65536 ( 256 × 256 ), as it reliably supports a 1:1 aspect ratio. Given this, we configure 𝑠

Let gn 𝑖 denote the gradient norm on NPU/GPUi for 𝑖

gn max

max 𝑖

ema gn

ema var

𝛿 𝑖

𝑀

∑ 𝑖

𝑔 𝑖 final

Training Details. We perform LoRA fine-tuning using LLaMA 3.1 8B1, completing within 1 hour on a single NPU/GPU. Fine-tuning is conducted for just 1 epoch with a batch size of 32 and a LoRA rank of 64. The AdamW optimizer is used with 𝛽 1

0.9 , 𝛽 2

{ 𝑧

𝑙 − 𝜇 𝜎 | 𝑙 ∈ ℒ } , to obtain the set of potential anomaly indices 𝒫

{ 𝑖 | 𝑧 𝑖 > 𝑧 𝑡 ⁢ ℎ ⁢ 𝑟 ⁢ 𝑒 ⁢ 𝑠 ⁢ ℎ ⁢ 𝑜 ⁢ 𝑙 ⁢ 𝑑 , 𝑧 𝑖 ∈ 𝒵 } . We further filter the anomalies by 𝒫 𝑓 ⁢ 𝑖 ⁢ 𝑛 ⁢ 𝑎 ⁢ 𝑙

2.0 , 𝑙 𝑡 ⁢ ℎ ⁢ 𝑟 ⁢ 𝑒 ⁢ 𝑠 ⁢ ℎ ⁢ 𝑜 ⁢ 𝑙 ⁢ 𝑑

0.35 , 𝑧 𝑡 ⁢ ℎ ⁢ 𝑟 ⁢ 𝑒 ⁢ 𝑠 ⁢ ℎ ⁢ 𝑜 ⁢ 𝑙 ⁢ 𝑑 ⁢ 2

3.2 , 𝑙 𝑡 ⁢ ℎ ⁢ 𝑟 ⁢ 𝑒 ⁢ 𝑠 ⁢ ℎ ⁢ 𝑜 ⁢ 𝑙 ⁢ 𝑑 ⁢ 2

Xet Storage Details

[ 𝒫 1 , 𝒫 2 , … ⁢ 𝒫 𝑀 ] ,
(13)
𝑭

[ 𝑭 1 , 𝑭 2 , … , 𝑭 𝑀 ] ,
(14)
𝑭 𝑗