Instructions to use mlx-community/Lance-3B-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/Lance-3B-bf16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Lance-3B-bf16 mlx-community/Lance-3B-bf16
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
📂 Part of the Lance MLX collection on mlx-community.
Lance-3B-bf16 (MLX, image specialist)
MLX port of ByteDance Intelligent Creation Lab's Lance unified multimodal model — the image-specialist Lance_3B checkpoint, converted to bf16 for Apple Silicon. 6.19 B LLM parameters in MoT (Mixture-of-Transformer-Experts) layout, plus the Qwen2.5-VL ViT (669 M) and Lance's bundled Wan2.2 VAE (~705 M) for full image-task coverage.
Lance is ByteDance's 3B-active unified multimodal model (paper, code, HF original). This is not Lance/LanceDB, the columnar data format.
Status
🟢 Production-ready for image tasks on Apple Silicon as of 2026-05-21.
| Capability | Status |
|---|---|
| t2i (text → image) | ✅ Photorealistic, prompt-aligned. 768² output at ~6.7 s/step. |
| image_edit (instruction-based editing) | ✅ Identity + style + signature preservation verified. ~6.7 s/step. |
| x2t_image (image understanding / VQA) | ✅ Content-correct across all 6 oracle cases. |
| KV cache for autoregressive decode | ✅ 1.7×–2.8× speedup over no-cache baseline. |
For video tasks (t2v, video_edit, x2t_video), see mlx-community/Lance-3B-Video-bf16. All six Lance task families are now validated end-to-end on Apple Silicon as of 2026-05-21.
The 48-channel Wan2.2 VAE is bundled here for convenience but also published standalone at mlx-community/Wan2.2-VAE-Lance-bf16 — both image_edit and the video pipelines need it.
Quickstart
Install from the source repo (will be on PyPI in a follow-up release):
git clone https://github.com/xocialize/lance-mlx
cd lance-mlx && uv sync
Download this checkpoint:
from huggingface_hub import snapshot_download
weights = snapshot_download("mlx-community/Lance-3B-bf16")
Text-to-image
from lance_mlx.pipeline.t2i import TextToImagePipeline
pipe = TextToImagePipeline.from_pretrained(
lance_weights_dir=weights,
vae_safetensors=f"{weights}/vae.safetensors",
)
image = pipe.generate(
"A photorealistic tabby cat holding up a colorful STOP sign on a sunlit street.",
height=768, width=768, num_steps=30, cfg_scale=4.0, seed=42,
)
image.save("cat_with_stop.png")
Image editing
from lance_mlx.pipeline.image_edit import ImageEditPipeline
pipe = ImageEditPipeline.from_pretrained(
lance_weights_dir=weights,
vae_safetensors=f"{weights}/vae.safetensors",
)
edited = pipe.generate(
input_image="portrait.jpg",
instruction="Remove the hat from the painting.",
height=768, width=768, num_steps=30, cfg_scale=4.0, seed=42,
)
edited.save("portrait_no_hat.png")
Image VQA / understanding
from lance_mlx.pipeline.understanding import UnderstandingPipeline
from PIL import Image
pipe = UnderstandingPipeline.from_pretrained(
lance_weights_dir=weights,
vit_safetensors=f"{weights}/vit.safetensors",
)
answer = pipe.generate(
Image.open("license_plate.png"),
"What is the license plate number visible in this image?",
max_new_tokens=64, prompt_style="lance",
)
print(answer)
Performance (M5 Max 128 GB, macOS 26.2, MLX bf16)
| Task | Configuration | Wall-clock |
|---|---|---|
| t2i | 768² × 30 steps × CFG=4.0 | ~201 s |
| image_edit | 768² × 30 steps × CFG=4.0 | ~201 s |
| x2t_image | 6 oracle cases (5–100 token answers), KV-cached | ~34 s combined |
KV cache scales with answer length: 1.7× speedup on a 5-token answer, 2.8× on a ~100-token answer.
Architecture
- Two expert towers (
LLM_UND,LLM_GEN), each initialized from Qwen2.5-VL-3B-Instruct, with per-expert FFN, output projection, and QK-norm. - Modality-deterministic routing: text + Qwen2.5-VL ViT semantic tokens →
LLM_UND(autoregressive next-token); Wan2.2 VAE latent tokens →LLM_GEN(flow-matching velocity prediction). No learned gate. - MaPE — modality-aware RoPE with per-modality temporal anchor (image-gen tokens re-anchored to t=1000).
- Wan2.2 3D causal VAE (16× spatial / 4× temporal compression, 48-channel latent — Lance bundles its own VAE; do NOT use the public 16-ch
wan2.2_vae.safetensors). - Bidirectional attention within latent block —
causal_mask OR full_and_noise_maskper upstreamdata/data_utils.py::create_sparse_mask. Without this, the noisy-VAE position 0 of a 2304-token image grid can only see itself + text, producing blurry outputs across all prompts. - Untied LM head.
Files in this repo
| File | Size | Purpose |
|---|---|---|
model.safetensors |
12.37 GB | LLM weights (1021 tensors, both UND + GEN towers) |
vit.safetensors |
1.34 GB | Qwen2.5-VL ViT (semantic encoder for x2t_image + image_edit) |
vae.safetensors |
1.41 GB | Lance's bundled Wan2.2 VAE (encoder + decoder, 48-ch) |
config.json |
– | Qwen2_5_VLForConditionalGeneration config with tie_word_embeddings=false |
conversion_report.json |
– | Provenance of safetensors conversion (PyTorch → MLX bf16) |
tokenizer.json / vocab.json |
– | Qwen2.5-VL vocabulary (151,936 tokens) |
Provenance
Source: bytedance-research/Lance/Lance_3B/model.safetensors (1021 tensors, 6.185 B params).
Conversion script: scripts/02_convert.py in the lance-mlx repo. The script:
- Loads original PyTorch safetensors, keeps F32 for normalization scales (per Phase 1b notes).
- Strips the
language_model.prefix; the MLXLanceModelis the root, not nested. - Splits the bundled ViT (
vit_model.*keys) into a siblingvit.safetensorsfor parity with theLance_3Bdistribution shape. - Re-keys
llm2vae.weight/biasandtime_embedder.mlp.{0,2}.{weight,bias}to match scaffolded MLX modules.
Wan2.2 VAE source: bytedance-research/Lance/Wan2.2_VAE.pth → scripts/06_convert_wan_vae.py. Roundtrip MAD on a real photo at 768² is ~7/255 in u8 domain.
Limitations
- bf16 only. 4-bit + 8-bit quantization in progress. Naive INT4 has been observed to degrade the GEN expert (per Reza2kn/lance-quant's findings); quantization needs per-tower calibration.
- English + Chinese prompts work; other languages are training-distribution-limited (Qwen2.5-VL was trained primarily on en + zh).
- No streaming / batching API yet. Single-image, single-prompt generation only.
- CFG runs the LLM twice per step. A future KV-cache for the text + clean-ref prefix would save ~30% on image_edit.
Documented divergences from upstream PyTorch
- Outputs differ in low-level pixel detail from a CUDA reference run on the same seed/prompt (~1–5% per-pixel deviation expected from bf16 vs fp32, MLX RoPE vs PyTorch RoPE rounding, and a small number of intermediate-norm precision steps). Semantic correctness preserved across the 6 x2t_image oracle cases and all visually-verified t2i + image_edit prompts.
- x2t_image answers differ stylistically from Phase 0 oracle (PyTorch) — consistent across all 6 cases. Tracked as a Phase 5 parity follow-up; does not affect content correctness.
License
This MLX port: Apache 2.0.
Underlying weights:
- Lance: Apache 2.0 (ByteDance Intelligent Creation Lab).
- Wan2.2 VAE: Apache 2.0 (Alibaba).
- Qwen2.5-VL: Apache 2.0 (Alibaba).
See NOTICE for full attribution.
Citation
@article{fu2026lance,
title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
author={Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and others},
journal={arXiv preprint arXiv:2605.18678},
year={2026}
}
Links
- MLX port code + phase notes:
github.com/xocialize/lance-mlx - Original PyTorch model:
bytedance-research/Lance - Wan2.2 VAE (standalone):
mlx-community/Wan2.2-VAE-Lance-bf16 - Video specialist (alpha):
mlx-community/Lance-3B-Video-bf16
- Downloads last month
- -
Quantized