Text-to-Video
Diffusers
Safetensors
English
efficient
mobile video generation
dit
pyramidal diffusion

@article{karnewar2025neodragon,
  author  = {Animesh Karnewar and Denis Korzhenkov and Ioannis Lelekas and Noor Fathima and Adil Karjauv and Hanwen Xiong and Vancheeswaran Vaidyanathan and Will Zeng and Rafael Esteves and Tushar Singhal and Fatih Porikli and Mohsen Ghafoorian and Amirhossein Habibian},
  title   = {Neodragon: Mobile Video Generation using Diffusion Transformer},
  journal = {arXiv preprint arXiv:2511.06055},
  year    = {2025},
  note    = {Published in the Proceedings of ICLR 2026. OpenReview: \url{https://openreview.net/forum?id=XBzIhhwv8d}; arXiv technical-report: \url{https://arxiv.org/abs/2511.06055}}
}

We introduce Neodragon, a text-to-video system capable of generating 2s (49 frames @24 fps) videos at a resolution of [640×1024] directly on a Qualcomm Hexagon NPU in a record ~6.7s (7 FPS). Differing from existing transformer-based offline text-to-video generation models, Neodragon is the first to have been specifically optimized for mobile hardware to achieve efficient, low-cost, and high-fidelity video synthesis.

  • Replacing the original large 4.762B T5XXL Text-Encoder with a much smaller 0.2B DT5 (DistilT5) with minimal quality loss, enabling the entire model to run without CPU offloading. This is enabled through a novel Text-Encoder Distillation procedure which uses only generative text-prompt data and does not require any image or video data.
  • Proposing an Asymmetric Decoder Distillation approach which allows us to replace the native codec-latent-VAE decoder with a more efficient one, without disturbing the generative latent-space of the video generation pipeline.
  • Pruning of MMDiT blocks within the denoiser backbone based on their relative importance, with recovery of original performance through a two-stage distillation process.
  • Reducing the NFE (Neural Functional Evaluation) requirement of the denoiser by performing step distillation using a technique adapted from DMD for pyramidal flow-matching, thereby significantly accelerating video generation.

When paired with an optimized SSD1B first-frame image generator and QuickSRNet for 2× super-resolution, our end-to-end Neodragon system becomes a highly parameter (4.945B full model), memory (3.5GB peak RAM usage), and runtime (6.7s E2E latency) efficient mobile-friendly model, while achieving a VBench total score of 81.61, yielding high-fidelity generated videos.

By enabling low-cost, private, and on-device text-to-video synthesis, Neodragon democratizes AI-based video content creation, empowering creators to generate high-quality videos without reliance on cloud services.

Inference code is available at: https://github.com/qualcomm-ai-research/neodragon

How to Inference

Please Refer to: https://github.com/qualcomm-ai-research/neodragon

Model Description

  • Developed by: Qualcomm AI Research, Generative Vision group, Amsterdam, Netherlands
  • Model type: Mobile Video Generation with efficient pyramidal Diffusion Transformer
  • Model size: 4.945B parameters (full package)
  • Model precision: torch.bfloat16 (BF16)
  • Model resolution: This model is developed to generate [320 x 512] resolution 49(2s @ 24fps) frames videos directly on a Snapdragon powered mobile phone.
  • Model Description: This is a model that can be used to generate videos based on the provided text prompts. It is a Diffusion Transformer that uses our finetuned TinyAEHV Auto-Encoder with 8x8x8x spatio-temporal-compressed latent features (TinyAEHV).
  • Resources for more information: Check out our GitHub Repository and the Technical-report on arXiv and the ICLR 2026 Openreview.

License/Terms of Use

This model is released under the terms-and-conditions laid out in the Qualcomm-AI-Hub-proprietory License.

Uses

The model is intended for research purposes. Possible research areas and tasks include:

  • Research on Efficient Transformer or non-Transformer based Backbone Architectures for Video Generation.
  • Generation of Image/Video based artworks and use in design and other artistic processes.
  • Applications in educational or creative tools.
  • Research on generative models.
  • Safe deployment of models which have the potential to generate harmful content.
  • Probing and understanding the limitations and biases of generative models. Excluded uses are described below.

Limitations and Bias

Limitations

  • The model does not achieve perfect photorealism
  • The model cannot render complex legible text
  • The model cannot produce videos with accurate physically compliant motion

Bias

While the capabilities of the presented mobile video generation model are impressive, they can also reinforce or exacerbate social biases strictly based on our foundational-base model Pyramidal-Flow.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for karnewar/Neodragon