Lance 3B Video - GGUF

This repository contains the quantized GGUF weights for Lance 3B Video, originally developed by ByteDance Research.

Lance is a lightweight, native unified multimodal model that supports image and video understanding, generation, and editing within a single framework. By quantizing these weights into GGUF format, the model becomes significantly more accessible for local inference, drastically reducing the massive 40GB VRAM requirement of the original unquantized model.

📊 Hardware Compatibility & Files

These GGUF files are designed for efficient CPU/GPU offloading. Choose the quantization level that best fits your available memory and desired quality.

Quantization	Filename	Size	Description
4-bit	`Lance_3B_Video-Q4_K_M.gguf`	4.96 GB	Optimal balance of speed and VRAM. Great for mid-range hardware.
5-bit	`Lance_3B_Video-Q5_K_M.gguf`	5.53 GB	Higher precision with minimal size increase.
6-bit	`Lance_3B_Video-Q6_K.gguf`	6.12 GB	Near-unquantized quality for higher-end local setups.
8-bit	`Lance_3B_Video-Q8_0.gguf`	7.62 GB	Maximum quality, requiring the most memory among these options.

🌟 Model Overview

Rather than relying on massive parameter scaling, Lance achieves state-of-the-art results through collaborative multi-task training and a dual-stream Mixture-of-Experts (MoE) architecture.

Efficient at 3B Scale: With only 3 billion active parameters, Lance delivers robust performance across video generation, editing, and understanding benchmarks.
Trained from Scratch: Built with a staged multi-task recipe and trained entirely from scratch within a highly efficient 128-A100-GPU budget.
Unified Architecture: Uses a dual-stream architecture on shared interleaved multimodal sequences. It employs modality-aware rotary positional encoding to prevent interference between different types of visual tokens.

Supported Tasks for this Video Variant:

Text-to-Video Generation: Generate highly detailed, physics-aware, and coherent videos from text prompts.
Video Editing: Perform background transformation, object replacement, and style changes while maintaining temporal consistency.
Multi-turn Consistency Editing: Chain multiple edits together on the same video seamlessly.
Video Understanding: Analyze video content for QA and detailed captioning.

🚀 How to Use

(Note: GGUF support for multimodal diffusion/generation models is rapidly evolving. You will typically use these weights inside of custom nodes for UI frameworks rather than standard text-based LLM runners.)

Recommended Ecosystems:

ComfyUI: Look for custom nodes that support GGUF unified multimodal or Lance architectures (similar to how Wan/Hunyuan/Flux GGUFs are currently implemented).
Custom Inference Scripts: Can be loaded via specialized Python wrappers that utilize ggml backend for multimodal pipelines.

Citation

If you use this model or the original architecture in your research, please cite the original ByteDance paper:

@article{fu2026lance,
  title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
  author={Fu, Fengyi and Wu, Shaojin and Guo, Jianzhu and others},
  journal={arXiv preprint arXiv:2605.18678},
  year={2026}
}

Downloads last month: 1,571

GGUF

Model size

6B params

Architecture

lance

Hardware compatibility

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Abiray/Lance_3B_Video-GGUF

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

bytedance-research/Lance

Quantized

(16)

this model

Paper for Abiray/Lance_3B_Video-GGUF

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Paper • 2605.18678 • Published May 18 • 79