Lance 3B Video - GGUF
This repository contains the quantized GGUF weights for Lance 3B Video, originally developed by ByteDance Research.
Lance is a lightweight, native unified multimodal model that supports image and video understanding, generation, and editing within a single framework. By quantizing these weights into GGUF format, the model becomes significantly more accessible for local inference, drastically reducing the massive 40GB VRAM requirement of the original unquantized model.
π Hardware Compatibility & Files
These GGUF files are designed for efficient CPU/GPU offloading. Choose the quantization level that best fits your available memory and desired quality.
| Quantization | Filename | Size | Description |
|---|---|---|---|
| 4-bit | Lance_3B_Video-Q4_K_M.gguf |
4.96 GB | Optimal balance of speed and VRAM. Great for mid-range hardware. |
| 5-bit | Lance_3B_Video-Q5_K_M.gguf |
5.53 GB | Higher precision with minimal size increase. |
| 6-bit | Lance_3B_Video-Q6_K.gguf |
6.12 GB | Near-unquantized quality for higher-end local setups. |
| 8-bit | Lance_3B_Video-Q8_0.gguf |
7.62 GB | Maximum quality, requiring the most memory among these options. |
π Model Overview
Rather than relying on massive parameter scaling, Lance achieves state-of-the-art results through collaborative multi-task training and a dual-stream Mixture-of-Experts (MoE) architecture.
- Efficient at 3B Scale: With only 3 billion active parameters, Lance delivers robust performance across video generation, editing, and understanding benchmarks.
- Trained from Scratch: Built with a staged multi-task recipe and trained entirely from scratch within a highly efficient 128-A100-GPU budget.
- Unified Architecture: Uses a dual-stream architecture on shared interleaved multimodal sequences. It employs modality-aware rotary positional encoding to prevent interference between different types of visual tokens.
Supported Tasks for this Video Variant:
- Text-to-Video Generation: Generate highly detailed, physics-aware, and coherent videos from text prompts.
- Video Editing: Perform background transformation, object replacement, and style changes while maintaining temporal consistency.
- Multi-turn Consistency Editing: Chain multiple edits together on the same video seamlessly.
- Video Understanding: Analyze video content for QA and detailed captioning.
π How to Use
(Note: GGUF support for multimodal diffusion/generation models is rapidly evolving. You will typically use these weights inside of custom nodes for UI frameworks rather than standard text-based LLM runners.)
Recommended Ecosystems:
- ComfyUI: Look for custom nodes that support GGUF unified multimodal or Lance architectures (similar to how Wan/Hunyuan/Flux GGUFs are currently implemented).
- Custom Inference Scripts: Can be loaded via specialized Python wrappers that utilize
ggmlbackend for multimodal pipelines.
Citation
If you use this model or the original architecture in your research, please cite the original ByteDance paper:
@article{fu2026lance,
title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
author={Fu, Fengyi and Wu, Shaojin and Guo, Jianzhu and others},
journal={arXiv preprint arXiv:2605.18678},
year={2026}
}
- Downloads last month
- 762
4-bit
5-bit
6-bit
8-bit