Instructions to use mlx-community/Lance-3B-Video-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/Lance-3B-Video-bf16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Lance-3B-Video-bf16 mlx-community/Lance-3B-Video-bf16
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Lance-3B-Video-bf16 (MLX, video specialist)
📂 Part of the Lance MLX collection on mlx-community.
Lance-3B-Video-bf16 (MLX, video specialist)
MLX port of ByteDance Intelligent Creation Lab's Lance — the video-specialist Lance_3B_Video checkpoint, converted to bf16 for Apple Silicon. ~6.44 B LLM parameters + 669 M Qwen2.5-VL ViT bundled, with the 126,976-entry latent_pos_embed table needed for video-scale latent grids.
Lance is ByteDance's 3B-active unified multimodal model (paper, code, HF original). This is not Lance/LanceDB, the columnar data format.
Status — 🚧 t2v port quality issue under investigation (2026-05-21)
Honest correction: an earlier version of this model card described Lance_3B_Video's painterly t2v aesthetic as "by design." Direct comparison against the Phase 0 PyTorch oracle (same weights, same prompt, same seed/scale at 768×768×50f) shows this was wrong — PyTorch Lance_3B_Video produces photorealistic 3D-cinematic output. Our MLX t2v port consistently produces softer painterly renderings instead.
This is a port-side bug (likely numerical or routing). Tracking and debugging in xocialize/lance-mlx issue #2. Highest-prior candidate at the moment: the MaPE temporal anchor we apply (ANCHOR_VIDEO_GEN=2000) which upstream's shift_position_ids doesn't actually fire for pure t2v.
| Capability | Status | Notes |
|---|---|---|
| t2v at 256² × ≤25f | 🟡 End-to-end works, painterly | Subject + scene recognizable; quality below PyTorch reference |
| t2v at 768² × ≤25f | 🟡 End-to-end works, painterly | Same — content correct, fidelity blocked on port fix |
| t2v at oracle scale (768²×50f) | ⏳ Not yet measured at oracle scale in MLX | Definitive test pending |
| x2t_video (video VQA / captioning) | ✅ Validated against Phase 0 oracle. Unaffected by t2v bug — ViT + UND-tower path only | |
| video_edit (instruction-based) | 🚧 Inherits t2v quality issue | End-to-end runs; cinematic quality awaits t2v fix |
For production-quality image tasks (t2i, image_edit, x2t_image), use mlx-community/Lance-3B-bf16 (or mlx-community/Lance-3B-8bit for 16 GB Macs). Those reproduce the PyTorch reference quality.
| Capability | Status | Notes |
|---|---|---|
| t2v at 256×256 × 16f | ✅ Works | ~33 s/clip on M5 Max. |
| t2v at 512×512 × 16f | ✅ Works | ~60 s/clip. |
| t2v at 768×768 × 13f (n_lat=9.2k) | ✅ Works | ~2.5 min/clip. Recognizable subjects (red panda with cap → "dog with hat"). |
| t2v at 768×768 × 17f (n_lat=11.5k) | ✅ Works | ~20 min/clip. "Five balls on a wooden table" → recognizable balls on wood texture, varied colors. |
| t2v at 768×768 × 25f (n_lat=16.1k) | 🟡 Validated; see commit notes | |
| t2v at 768×768 × 49f (n_lat=30k) | ⚠️ Functional but slow (~2¼h/clip on M5 Max). Memory and time become impractical for casual use. | |
| x2t_video (video VQA / captioning) | ✅ Validated against Phase 0 oracle. Cooking-video VQA produces content-correct 256-token caption (kitchen + pan + spatula + tomato + meat + stirring all matched) in 17.5 s. | |
| video_edit (instruction-based) | ✅ Functional. "Change all the balls to a deep red color." → balls recolored, composition preserved. 17 frames × 256² in 81.6 s. |
For production-quality photorealistic image tasks (t2i, image_edit, x2t_image), use the sibling repo mlx-community/Lance-3B-bf16 — Lance_3B is the image specialist with crystal aesthetic.
Why "painterly" is the wrong framing
Earlier we believed Lance_3B_Video's painterly aesthetic was a deliberate fine-tune choice. The Phase 4c per-tensor diff (_moe_gen QK-norms differ by 0.5–0.85 in 6+ layers; lm_head and embed_tokens byte-identical with Lance_3B) is real — but the conclusion we drew from it was wrong. The Phase 0 PyTorch oracle, generated from these exact weights at validation_timestep_shift 3.5, cfg_text_scale 4.0, validation_num_timesteps 30, produces clean photorealistic / 3D-cinematic output. Same model, same config — different aesthetic. That means the painterly look we see is our MLX port doing something wrong, not the model expressing an intentional style.
Why a separate "Video" checkpoint?
ByteDance ships two variants of Lance that differ in fine-tuning:
Lance_3B— image specialist. Crystal-clear photorealistic t2i.Lance_3B_Video— video specialist. Same architecture, further fine-tuned on video data. Bundles the Qwen2.5-VL ViT (669 M) and the larger 126,976-entrylatent_pos_embedtable that addresses video-resolution token grids.
Quickstart
Install from the lance-mlx source repo:
git clone https://github.com/xocialize/lance-mlx
cd lance-mlx && uv sync
Download this checkpoint:
from huggingface_hub import snapshot_download
weights = snapshot_download("mlx-community/Lance-3B-Video-bf16")
Text-to-video
from lance_mlx.pipeline.t2v import TextToVideoPipeline
pipe = TextToVideoPipeline.from_pretrained(
lance_weights_dir=weights,
vae_safetensors=f"{weights}/vae.safetensors",
)
frames = pipe.generate(
"Five balls on a wooden table: two blue, three green.",
num_frames=17, height=768, width=768,
num_steps=30, cfg_scale=4.0, seed=42,
)
# frames is np.ndarray of shape (T_decoded, H, W, 3) uint8
Encode to MP4 with imageio:
import imageio
with imageio.get_writer("out.mp4", fps=12, codec="libx264") as writer:
for f in frames:
writer.append_data(f)
Video understanding
from lance_mlx.pipeline.understanding import UnderstandingPipeline
pipe = UnderstandingPipeline.from_pretrained(
lance_weights_dir=weights,
vit_safetensors=f"{weights}/vit.safetensors",
)
answer = pipe.generate_video(
video="my_video.mp4",
question="Describe what happens in this video.",
num_sample_frames=16, target_h=224, target_w=224,
max_new_tokens=256, prompt_style="lance",
)
print(answer)
Validated content-correct against the Phase 0 oracle's cooking VQA case (kitchen + pan + spatula + tomato + meat + stirring matched).
Video editing
from lance_mlx.pipeline.video_edit import VideoEditPipeline
pipe = VideoEditPipeline.from_pretrained(
lance_weights_dir=weights,
vae_safetensors=f"{weights}/vae.safetensors",
)
frames = pipe.generate(
input_video="my_video.mp4",
instruction="Change all the balls to a deep red color.",
height=256, width=256, num_frames=17,
num_steps=30, cfg_scale=4.0, seed=42,
)
Performance (M5 Max 128 GB)
| Task | Configuration | Wall-clock |
|---|---|---|
| t2v | 256² × 16f, 30 steps, CFG=4.0 | ~33 s |
| t2v | 512² × 16f, 30 steps, CFG=4.0 | ~60 s |
| t2v | 768² × 13f, 30 steps, CFG=4.0 | ~145 s |
| t2v | 768² × 17f, 30 steps, CFG=4.0 | ~20 min |
| t2v | 768² × 49f, 30 steps, CFG=4.0 | ~2¼ hours (impractical) |
CFG doubles the forward cost since cond + uncond run sequentially. Attention scales O(N²) in latent-token count, so high-frame, high-resolution combos become quickly impractical. KV cache for the text prefix is a Phase 5 follow-up.
Files in this repo
| File | Size | Purpose |
|---|---|---|
model.safetensors |
12.87 GB | LLM weights (1021 tensors, both UND + GEN towers, with 126,976-entry latent_pos_embed) |
vit.safetensors |
1.34 GB | Qwen2.5-VL ViT (semantic encoder for x2t_video) |
vae.safetensors |
1.41 GB | Lance's bundled Wan2.2 VAE (also available standalone as mlx-community/Wan2.2-VAE-Lance-bf16) |
config.json |
– | Qwen2_5_VLForConditionalGeneration config |
conversion_report.json |
– | Provenance |
tokenizer.json / vocab.json |
– | Qwen2.5-VL vocabulary |
Provenance
Source: bytedance-research/Lance/Lance_3B_Video/model.safetensors (1411 tensors including bundled ViT; 6.437 B LLM + 0.669 B ViT params).
Converted via scripts/02_convert.py. The bundled ViT is extracted to a sibling vit.safetensors with the vit_model. prefix stripped, matching the layout convention of the image-specialist repo.
Tips
- Use concrete-subject prompts. "Five red apples in a bowl" works better than "the joy of friendship in motion." The model can render abstract scenes, but the painterly aesthetic on already-abstract subjects can read as overly abstract.
- Smaller scales iterate faster. 256² × 16 frames is the fastest test config (~33 s); good for prompt iteration. Scale up once you find a prompt you like.
- English + Chinese prompts work. Other languages are out of distribution (Qwen2.5-VL was trained primarily on en + zh).
Limitations
- bf16 only. 4-bit + 8-bit quantization in progress (Phase 5b). Naive INT4 has been observed to degrade the GEN expert per Reza2kn/lance-quant's findings; quantization needs per-tower calibration.
- No streaming or batched generation.
- CFG doubles forward cost. A future KV-cache for the text + clean-ref prefix would save ~30% per step.
Architecture (shared with the image specialist)
- Two expert towers (
LLM_UND,LLM_GEN), each initialized from Qwen2.5-VL-3B-Instruct, with per-expert FFN, output projection, and QK-norm. - Modality-deterministic routing: text + Qwen2.5-VL ViT semantic tokens →
LLM_UND(autoregressive); Wan2.2 VAE latent tokens →LLM_GEN(flow-matching velocity prediction). No learned gate. - MaPE — modality-aware RoPE with per-modality temporal anchor.
- Wan2.2 3D causal VAE (16× spatial / 4× temporal compression, 48-channel latent).
- Bidirectional attention within latent block.
- Untied LM head.
License
This MLX port: Apache 2.0.
Underlying weights:
- Lance: Apache 2.0 (ByteDance Intelligent Creation Lab).
- Wan2.2 VAE: Apache 2.0 (Alibaba).
- Qwen2.5-VL: Apache 2.0 (Alibaba).
See NOTICE for attribution.
Citation
@article{fu2026lance,
title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
author={Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and others},
journal={arXiv preprint arXiv:2605.18678},
year={2026}
}
Links
- MLX port code + phase notes:
github.com/xocialize/lance-mlx - Original PyTorch model:
bytedance-research/Lance - Image specialist (production):
mlx-community/Lance-3B-bf16 - Wan2.2 VAE (standalone):
mlx-community/Wan2.2-VAE-Lance-bf16
- Downloads last month
- -
Quantized