Text-to-Video
MLX
Safetensors
lance
multimodal
apple-silicon
text-to-image
image-generation
video-generation
diffusion
flow-matching
Mixture of Experts
qwen2_5_vl
wan
port
Instructions to use RockTalk/Lance-3B-Video-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use RockTalk/Lance-3B-Video-MLX with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Lance-3B-Video-MLX RockTalk/Lance-3B-Video-MLX
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Initial release: Lance-3B-Video-MLX weights; T2I working, T2V pending VAE streaming cache
Browse files- README.md +129 -55
- config.json +6 -5
README.md
CHANGED
|
@@ -3,97 +3,171 @@ license: apache-2.0
|
|
| 3 |
base_model:
|
| 4 |
- bytedance-research/Lance
|
| 5 |
- Qwen/Qwen2.5-VL-3B-Instruct
|
| 6 |
-
pipeline_tag:
|
| 7 |
library_name: mlx
|
| 8 |
tags:
|
| 9 |
- multimodal
|
| 10 |
- mlx
|
| 11 |
- apple-silicon
|
|
|
|
| 12 |
- image-generation
|
| 13 |
- video-generation
|
| 14 |
-
-
|
| 15 |
-
-
|
| 16 |
-
-
|
|
|
|
|
|
|
| 17 |
- port
|
| 18 |
---
|
| 19 |
|
| 20 |
# Lance-3B-Video-MLX
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
-
|
| 25 |
|
| 26 |
-
##
|
| 27 |
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|---|---|
|
| 34 |
-
|
|
| 35 |
-
|
|
| 36 |
-
|
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
## Files
|
| 45 |
|
| 46 |
-
|
| 47 |
-
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
|
|
|
|
|
|
| 52 |
|
| 53 |
-
##
|
| 54 |
|
| 55 |
-
|
| 56 |
|
| 57 |
-
|
|
|
|
|
|
|
| 58 |
|
| 59 |
```python
|
| 60 |
import mlx.core as mx
|
| 61 |
-
import mlx.nn as nn
|
| 62 |
from lance_mlx.lance import Lance, LanceConfig
|
| 63 |
-
from
|
| 64 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
-
|
| 67 |
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
```
|
| 74 |
|
| 75 |
-
|
| 76 |
|
| 77 |
-
```bibtex
|
| 78 |
-
@article{lance2026,
|
| 79 |
-
title = {Lance: Unified Multimodal Modeling by Multi-Task Synergy},
|
| 80 |
-
author = {Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and Jiang, Yunsheng and Huo, Yufei and Guo, Jianzhu and others},
|
| 81 |
-
journal = {arXiv preprint arXiv:2605.18678},
|
| 82 |
-
year = {2026},
|
| 83 |
-
url = {http://arxiv.org/abs/2605.18678}
|
| 84 |
-
}
|
| 85 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
|
| 87 |
## License
|
| 88 |
|
| 89 |
-
Apache
|
| 90 |
|
| 91 |
-
##
|
| 92 |
|
| 93 |
-
- ByteDance Research
|
| 94 |
-
-
|
| 95 |
-
-
|
|
|
|
|
|
|
| 96 |
|
| 97 |
-
|
| 98 |
|
| 99 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
base_model:
|
| 4 |
- bytedance-research/Lance
|
| 5 |
- Qwen/Qwen2.5-VL-3B-Instruct
|
| 6 |
+
pipeline_tag: text-to-video
|
| 7 |
library_name: mlx
|
| 8 |
tags:
|
| 9 |
- multimodal
|
| 10 |
- mlx
|
| 11 |
- apple-silicon
|
| 12 |
+
- text-to-image
|
| 13 |
- image-generation
|
| 14 |
- video-generation
|
| 15 |
+
- diffusion
|
| 16 |
+
- flow-matching
|
| 17 |
+
- moe
|
| 18 |
+
- qwen2_5_vl
|
| 19 |
+
- wan
|
| 20 |
- port
|
| 21 |
---
|
| 22 |
|
| 23 |
# Lance-3B-Video-MLX
|
| 24 |
|
| 25 |
+
Video variant of [Lance-3B-MLX](https://huggingface.co/RockTalk/Lance-3B-MLX). First native [MLX](https://github.com/ml-explore/mlx) port of [ByteDance Research's Lance](https://huggingface.co/bytedance-research/Lance) β a 3B-parameter unified multimodal model for image/video generation, editing, and understanding. Runs natively on Apple Silicon, no CUDA required.
|
| 26 |
|
| 27 |
+
The architecture is **Qwen2.5-VL-3B + parallel MoE-gen experts + Wan 2.2 VAE**. Lance uses a "Mixture-of-Tokens" routing: every attention block and MLP has a parallel `*_moe_gen` branch. Text tokens go through normal weights; VAE-latent (generation) tokens go through the `_moe_gen` weights, in the same forward pass.
|
| 28 |
|
| 29 |
+
## What works
|
| 30 |
|
| 31 |
+
| Capability | Status |
|
| 32 |
+
|---|---|
|
| 33 |
+
| Text-to-image (T2I), single image, CFG | β
Working, verified |
|
| 34 |
+
| Strict load of all 1021 LLM/adapter tensors | β
Working |
|
| 35 |
+
| Wan 2.2 VAE encode/decode (T=1) | β
Working (uses [RockTalk/Wan2.2-VAE-MLX](https://huggingface.co/RockTalk/Wan2.2-VAE-MLX)) |
|
| 36 |
+
| Flow-matching denoising loop | β
Working |
|
| 37 |
+
| Classifier-free guidance | β
Working |
|
| 38 |
+
| 3D mrope position embeddings | β
Working |
|
| 39 |
+
| MoE-gen routing (per-token attention + MLP + layernorm) | β
Working |
|
| 40 |
+
| Text-to-video (T2V) | β Weights ready (31 latent frames Γ 64Β² positional grid); needs Wan 2.2 VAE T>1 streaming cache to materialize video frames end-to-end |
|
| 41 |
+
| Image/video editing (TI2I, TIV2V) | β³ Phase 2 β needs ViT integration |
|
| 42 |
+
| XβT (image/video understanding) | β³ Phase 2 β needs AR sampling loop + KV cache |
|
| 43 |
+
|
| 44 |
+
## Sample generations
|
| 45 |
+
|
| 46 |
+
Verified on M4 Studio (128 GB). 30 steps, CFG=4, 512Γ512:
|
| 47 |
+
|
| 48 |
+
| Prompt | Output |
|
| 49 |
|---|---|
|
| 50 |
+
| *"a photo of a sunset over mountains"* |  |
|
| 51 |
+
| *"a fluffy orange cat sitting on a wooden chair, photorealistic"* |  |
|
| 52 |
+
| *"a majestic snowy mountain peak with a dramatic blue sky and clouds"* |  |
|
| 53 |
+
|
| 54 |
+
## Performance
|
| 55 |
+
|
| 56 |
+
Measured on M4 Studio (128 GB) at CFG=4 (one conditional + one unconditional forward per step):
|
| 57 |
+
|
| 58 |
+
| Resolution | Steps | Per-step | Total sample | VAE decode |
|
| 59 |
+
|---|---|---|---|---|
|
| 60 |
+
| 256Γ256 | 24 | ~400 ms | ~9.6 s | ~0.1 s |
|
| 61 |
+
| 512Γ512 | 30 | ~1.2 s | ~36 s | ~0.5 s |
|
| 62 |
+
|
| 63 |
+
First-call kernel-compile penalty: ~few seconds per new resolution.
|
| 64 |
+
|
| 65 |
+
## Differences vs Lance-3B-MLX
|
| 66 |
+
|
| 67 |
+
This is the same architecture as the image variant, with two differences:
|
| 68 |
+
- `model.safetensors`: 26.5 GB (vs 23 GB) β extra weights for multi-frame attention
|
| 69 |
+
- `latent_pos_embed.pos_embed`: 31 Γ 64 Γ 64 = 126,976 positions (vs 1 Γ 64 Γ 64 = 4,096) β supports up to 31 latent frames (β 121 video frames @ 4Γ temporal downsample)
|
| 70 |
+
|
| 71 |
+
T2I via this checkpoint works the same as Lance-3B-MLX. T2V will work once the Wan 2.2 VAE temporal streaming cache is implemented (v0.1.0 of [RockTalk/Wan2.2-VAE-MLX](https://huggingface.co/RockTalk/Wan2.2-VAE-MLX)).
|
| 72 |
|
| 73 |
## Files
|
| 74 |
|
| 75 |
+
| File | Size | Description |
|
| 76 |
+
|---|---|---|
|
| 77 |
+
| `model.safetensors` | 26.5 GB | LLM (Qwen2.5-VL with MoE-gen) + Lance adapters, 1021 tensors |
|
| 78 |
+
| `vit.safetensors` | 1.25 GB | Qwen2.5-VL ViT (for understanding mode β Phase 2) |
|
| 79 |
+
| `vae.safetensors` | 2.62 GB | Wan 2.2 VAE (older keying β for compatibility; the standalone [RockTalk/Wan2.2-VAE-MLX](https://huggingface.co/RockTalk/Wan2.2-VAE-MLX) uses cleaner keys and is recommended) |
|
| 80 |
+
| `config.json` | β | Distilled architecture config |
|
| 81 |
+
| `tokenizer.json`, `vocab.json`, `merges.txt` | β | Qwen2.5-VL tokenizer, verbatim |
|
| 82 |
+
| `samples/*.png` | β | Verified T2I outputs from this checkpoint |
|
| 83 |
|
| 84 |
+
## Usage
|
| 85 |
|
| 86 |
+
Requires `mlx >= 0.29`, `mlx-vlm >= 0.3`, `numpy`, `einops`, `transformers`, `pillow`, and the [`lance-mlx`](https://github.com/RockTalk/Lance-MLX) companion repo for the `Lance` Python class.
|
| 87 |
|
| 88 |
+
```bash
|
| 89 |
+
pip install mlx mlx-vlm numpy einops transformers pillow
|
| 90 |
+
```
|
| 91 |
|
| 92 |
```python
|
| 93 |
import mlx.core as mx
|
|
|
|
| 94 |
from lance_mlx.lance import Lance, LanceConfig
|
| 95 |
+
from lance_mlx.vae_wan22 import Wan2_2_VAE
|
| 96 |
+
|
| 97 |
+
# Build + strict-load (see tools/lance_t2i.py in the companion repo for the
|
| 98 |
+
# full builder; LanceConfig takes a Qwen2.5-VL ModelConfig built from
|
| 99 |
+
# config.json).
|
| 100 |
+
model = Lance(lance_cfg)
|
| 101 |
+
model.load_weights(list(mx.load("model.safetensors").items()), strict=True)
|
| 102 |
+
|
| 103 |
+
vae = Wan2_2_VAE(z_dim=48, c_dim=160, dim_mult=(1, 2, 4, 4),
|
| 104 |
+
temperal_downsample=(False, True, True))
|
| 105 |
+
vae.model.load_weights(list(mx.load("vae.safetensors").items()), strict=True)
|
| 106 |
+
|
| 107 |
+
# Sample
|
| 108 |
+
latent = model.sample_t2i(
|
| 109 |
+
prompt_token_ids=text_ids, # (P,) int32 from tokenizer (no specials)
|
| 110 |
+
latent_shape=(1, 32, 32), # (T_lat, H_lat, W_lat) for 512Γ512 image
|
| 111 |
+
special_token_ids={"bos": 151644, "eos": 151645,
|
| 112 |
+
"start_of_image": 151652, "end_of_image": 151653,
|
| 113 |
+
"image_token_id": 151655},
|
| 114 |
+
num_steps=30, timestep_shift=3.5, cfg_scale=4.0, seed=0,
|
| 115 |
+
)
|
| 116 |
+
img = vae.decode(latent) # (1, 1, 512, 512, 3) in [-1, 1]
|
| 117 |
+
```
|
| 118 |
|
| 119 |
+
End-to-end script: `tools/lance_t2i.py` in the [companion repo](https://github.com/RockTalk/Lance-MLX).
|
| 120 |
|
| 121 |
+
## How the MoE-gen routing is implemented in MLX
|
| 122 |
+
|
| 123 |
+
Lance's checkpoint contains *two* sets of weights per Qwen2 block:
|
| 124 |
+
|
| 125 |
+
```
|
| 126 |
+
self_attn.{q,k,v,o}_proj self_attn.{q,k,v,o}_proj_moe_gen
|
| 127 |
+
self_attn.{q,k}_norm self_attn.{q,k}_norm_moe_gen
|
| 128 |
+
mlp.{gate,down,up}_proj mlp_moe_gen.{gate,down,up}_proj
|
| 129 |
+
input_layernorm input_layernorm_moe_gen
|
| 130 |
+
post_attention_layernorm post_attention_layernorm_moe_gen
|
| 131 |
```
|
| 132 |
|
| 133 |
+
For T2I/T2V the sequence layout is:
|
| 134 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 135 |
```
|
| 136 |
+
<|im_start|> [prompt tokens] <|im_end|> <|vision_start|> [N latent placeholders] <|vision_end|>
|
| 137 |
+
βββββ routed through moe_gen βββββ
|
| 138 |
+
β everything else: normal weights
|
| 139 |
+
```
|
| 140 |
+
|
| 141 |
+
The MLX port (`qwen2_navit_mlx.py`) routes by slicing the sequence into the latent slab vs the surrounding text, applying the appropriate expert to each slab, and concatenating. mrope position ids continue to flow normally across both slabs (with axis-T/H/W coordinates only varying inside the latent slab).
|
| 142 |
+
|
| 143 |
+
## Conversion source
|
| 144 |
+
|
| 145 |
+
Converted from `bytedance-research/Lance/Lance_3B/*` using the open-source pipeline at https://github.com/RockTalk/Lance-MLX (`tools/convert_weights.py`). Layout transforms:
|
| 146 |
+
|
| 147 |
+
- Conv weights: PT `(O, I, [T,] H, W)` β MLX `(O, [T,] H, W, I)`
|
| 148 |
+
- Embedding weights: shape preserved
|
| 149 |
+
- `lm_head.weight` tied to `embed_tokens.weight` (Qwen default)
|
| 150 |
+
- All `*_moe_gen.*` keys copied verbatim under the same names
|
| 151 |
|
| 152 |
## License
|
| 153 |
|
| 154 |
+
Apache 2.0, inherited from upstream `bytedance-research/Lance`. The Wan 2.2 VAE component is also Apache 2.0 from Alibaba's Wan team.
|
| 155 |
|
| 156 |
+
## Acknowledgements
|
| 157 |
|
| 158 |
+
- **ByteDance Research** β original Lance training + PT release
|
| 159 |
+
- **Qwen team** β Qwen2.5-VL-3B-Instruct backbone
|
| 160 |
+
- **Alibaba Wan team** β Wan 2.2 VAE training
|
| 161 |
+
- **Apple `mlx` and `mlx-vlm` teams** β the underlying frameworks
|
| 162 |
+
- **This MLX port** β RockTalk
|
| 163 |
|
| 164 |
+
## Citation
|
| 165 |
|
| 166 |
+
```bibtex
|
| 167 |
+
@misc{lance_mlx,
|
| 168 |
+
title = {Lance-3B-MLX β First MLX port of ByteDance's Lance},
|
| 169 |
+
author = {RockTalk},
|
| 170 |
+
year = {2026},
|
| 171 |
+
url = {https://huggingface.co/RockTalk/Lance-3B-MLX}
|
| 172 |
+
}
|
| 173 |
+
```
|
config.json
CHANGED
|
@@ -67,14 +67,15 @@
|
|
| 67 |
},
|
| 68 |
"latent_patch_size": [
|
| 69 |
1,
|
| 70 |
-
|
| 71 |
-
|
| 72 |
],
|
| 73 |
-
"max_latent_size":
|
| 74 |
-
"max_num_frames":
|
| 75 |
"latent_channel": 48,
|
| 76 |
"vae_downsample_spatial": 16,
|
| 77 |
"vae_downsample_temporal": 4,
|
| 78 |
"connector_act": "gelu_pytorch_tanh",
|
| 79 |
-
"timestep_shift": 3.5
|
|
|
|
| 80 |
}
|
|
|
|
| 67 |
},
|
| 68 |
"latent_patch_size": [
|
| 69 |
1,
|
| 70 |
+
1,
|
| 71 |
+
1
|
| 72 |
],
|
| 73 |
+
"max_latent_size": 64,
|
| 74 |
+
"max_num_frames": 120,
|
| 75 |
"latent_channel": 48,
|
| 76 |
"vae_downsample_spatial": 16,
|
| 77 |
"vae_downsample_temporal": 4,
|
| 78 |
"connector_act": "gelu_pytorch_tanh",
|
| 79 |
+
"timestep_shift": 3.5,
|
| 80 |
+
"max_num_latent_frames": 31
|
| 81 |
}
|