RockTalk commited on
Commit
fd60d09
Β·
verified Β·
1 Parent(s): a7c0c8e

Initial release: Lance-3B-Video-MLX weights; T2I working, T2V pending VAE streaming cache

Browse files
Files changed (2) hide show
  1. README.md +129 -55
  2. config.json +6 -5
README.md CHANGED
@@ -3,97 +3,171 @@ license: apache-2.0
3
  base_model:
4
  - bytedance-research/Lance
5
  - Qwen/Qwen2.5-VL-3B-Instruct
6
- pipeline_tag: any-to-any
7
  library_name: mlx
8
  tags:
9
  - multimodal
10
  - mlx
11
  - apple-silicon
 
12
  - image-generation
13
  - video-generation
14
- - image-editing
15
- - video-understanding
16
- - any-to-any
 
 
17
  - port
18
  ---
19
 
20
  # Lance-3B-Video-MLX
21
 
22
- A native [MLX](https://github.com/ml-explore/mlx) port of [ByteDance's Lance](https://huggingface.co/bytedance-research/Lance) β€” a 3B-parameter unified multimodal model for image and video generation, editing, and understanding.
23
 
24
- Built on top of the Qwen2.5-VL-3B-Instruct backbone, with Lance's custom multi-task adapters and a Wan 2.2 VAE.
25
 
26
- ## Status
27
 
28
- **Weight conversion is complete** β€” all tensors from the upstream PyTorch
29
- checkpoint are present in MLX safetensors layout, verified bit-exact on
30
- sampled tensors. **Inference wrapper is not yet runnable.**
31
-
32
- | Component | Status |
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  |---|---|
34
- | Weight conversion (PT β†’ MLX safetensors, layout + name remaps) | βœ… DONE β€” bit-exact spot check |
35
- | `modeling_utils` (TimestepEmbedder, PositionEmbedding3D, MLP, sincos tables) | βœ… DONE |
36
- | `vae_wan22` β€” image-mode encode/decode | βœ… DONE |
37
- | `vae_wan22` β€” video streaming feat-cache | ⏳ PENDING |
38
- | `lance.py` adapters | ⚠ PARTIAL β€” primitives present; patchify path needs source-faithful rewrite |
39
- | Lance MoE-gen attention (`_moe_gen` weights bundled, 505 tensors) | ⏳ NOT YET WRAPPED |
40
- | Lance QK-norm extension (`q_norm`/`k_norm` weights bundled, 73 tensors) | ⏳ NOT YET WRAPPED |
41
- | Flow-matching sampler / CFG | ⏳ STUB |
42
- | Xβ†’T autoregressive / NaViT | ⏳ Phase 2 |
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
  ## Files
45
 
46
- - `model.safetensors` β€” Lance 3B **video variant** LLM + adapters (Qwen2.5-VL language model, vae2llm/llm2vae, time_embedder, latent_pos_embed), MLX-layout, **26.4 GB** (1411 tensors, ~7.1B params)
47
- - `vit.safetensors` β€” Qwen2.5-VL ViT visual encoder, MLX-layout (NTHWC conv weights), **1.25 GB** (390 tensors, ~668M params, fp16 β€” bundled here for offline use)
48
- - `vae.safetensors` β€” Wan 2.2 VAE, MLX-layout, **2.62 GB** (196 tensors, ~705M params). Converted from the upstream `Wan2.2_VAE.pth` pickle.
49
- - `config.json` β€” distilled architecture config + embedded Qwen2.5-VL sub-config
50
- - `vit_config.json` β€” Qwen2.5-VL ViT sub-config
51
- - `tokenizer.json`, `tokenizer_config.json`, `vocab.json`, `merges.txt`, `generation_config.json` β€” copied verbatim from upstream
 
 
52
 
53
- ## Hardware
54
 
55
- Targets Apple Silicon with unified memory. Verified on M3 Ultra (512 GB). Lower-RAM Macs may need to run the LLM forward only (no joint backbone + VAE).
56
 
57
- ## Loading
 
 
58
 
59
  ```python
60
  import mlx.core as mx
61
- import mlx.nn as nn
62
  from lance_mlx.lance import Lance, LanceConfig
63
- from mlx_vlm.models.qwen2_5_vl.config import ModelConfig as Qwen25VLConfig
64
- import json
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
66
- from lance_mlx import load_lance
67
 
68
- model, cfg = load_lance("./")
69
- # cfg['_loaded_into_model'] reports how many tensors landed in the wrapper.
70
- # cfg['_vae_weights'] holds the Wan VAE keys (load into a separate vae module).
71
- # cfg['_moe_gen_weights'] holds Lance's generation-path weights, parked until
72
- # a MoE-aware wrapper is available.
 
 
 
 
 
73
  ```
74
 
75
- ## Citation
76
 
77
- ```bibtex
78
- @article{lance2026,
79
- title = {Lance: Unified Multimodal Modeling by Multi-Task Synergy},
80
- author = {Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and Jiang, Yunsheng and Huo, Yufei and Guo, Jianzhu and others},
81
- journal = {arXiv preprint arXiv:2605.18678},
82
- year = {2026},
83
- url = {http://arxiv.org/abs/2605.18678}
84
- }
85
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
 
87
  ## License
88
 
89
- Apache-2.0, inherited from upstream `bytedance-research/Lance`.
90
 
91
- ## Acknowledgments
92
 
93
- - ByteDance Research for the original Lance training and PyTorch release
94
- - The `mlx` and `mlx-vlm` teams at Apple
95
- - Qwen team for Qwen2.5-VL-3B-Instruct
 
 
96
 
97
- ---
98
 
99
- **Port status reporting honestly:** this repo currently provides MLX-format weights with verified-loading scaffolding. Inference sampling (T2I/T2V) is a follow-up release; the building blocks are in place but the diffusion loop has not been parity-validated end-to-end yet. Pull requests welcome.
 
 
 
 
 
 
 
 
3
  base_model:
4
  - bytedance-research/Lance
5
  - Qwen/Qwen2.5-VL-3B-Instruct
6
+ pipeline_tag: text-to-video
7
  library_name: mlx
8
  tags:
9
  - multimodal
10
  - mlx
11
  - apple-silicon
12
+ - text-to-image
13
  - image-generation
14
  - video-generation
15
+ - diffusion
16
+ - flow-matching
17
+ - moe
18
+ - qwen2_5_vl
19
+ - wan
20
  - port
21
  ---
22
 
23
  # Lance-3B-Video-MLX
24
 
25
+ Video variant of [Lance-3B-MLX](https://huggingface.co/RockTalk/Lance-3B-MLX). First native [MLX](https://github.com/ml-explore/mlx) port of [ByteDance Research's Lance](https://huggingface.co/bytedance-research/Lance) β€” a 3B-parameter unified multimodal model for image/video generation, editing, and understanding. Runs natively on Apple Silicon, no CUDA required.
26
 
27
+ The architecture is **Qwen2.5-VL-3B + parallel MoE-gen experts + Wan 2.2 VAE**. Lance uses a "Mixture-of-Tokens" routing: every attention block and MLP has a parallel `*_moe_gen` branch. Text tokens go through normal weights; VAE-latent (generation) tokens go through the `_moe_gen` weights, in the same forward pass.
28
 
29
+ ## What works
30
 
31
+ | Capability | Status |
32
+ |---|---|
33
+ | Text-to-image (T2I), single image, CFG | βœ… Working, verified |
34
+ | Strict load of all 1021 LLM/adapter tensors | βœ… Working |
35
+ | Wan 2.2 VAE encode/decode (T=1) | βœ… Working (uses [RockTalk/Wan2.2-VAE-MLX](https://huggingface.co/RockTalk/Wan2.2-VAE-MLX)) |
36
+ | Flow-matching denoising loop | βœ… Working |
37
+ | Classifier-free guidance | βœ… Working |
38
+ | 3D mrope position embeddings | βœ… Working |
39
+ | MoE-gen routing (per-token attention + MLP + layernorm) | βœ… Working |
40
+ | Text-to-video (T2V) | ⚠ Weights ready (31 latent frames Γ— 64Β² positional grid); needs Wan 2.2 VAE T>1 streaming cache to materialize video frames end-to-end |
41
+ | Image/video editing (TI2I, TIV2V) | ⏳ Phase 2 β€” needs ViT integration |
42
+ | Xβ†’T (image/video understanding) | ⏳ Phase 2 β€” needs AR sampling loop + KV cache |
43
+
44
+ ## Sample generations
45
+
46
+ Verified on M4 Studio (128 GB). 30 steps, CFG=4, 512Γ—512:
47
+
48
+ | Prompt | Output |
49
  |---|---|
50
+ | *"a photo of a sunset over mountains"* | ![sunset](samples/sunset_mountains.png) |
51
+ | *"a fluffy orange cat sitting on a wooden chair, photorealistic"* | ![cat](samples/orange_cat_chair.png) |
52
+ | *"a majestic snowy mountain peak with a dramatic blue sky and clouds"* | ![mountain](samples/snowy_peak.png) |
53
+
54
+ ## Performance
55
+
56
+ Measured on M4 Studio (128 GB) at CFG=4 (one conditional + one unconditional forward per step):
57
+
58
+ | Resolution | Steps | Per-step | Total sample | VAE decode |
59
+ |---|---|---|---|---|
60
+ | 256Γ—256 | 24 | ~400 ms | ~9.6 s | ~0.1 s |
61
+ | 512Γ—512 | 30 | ~1.2 s | ~36 s | ~0.5 s |
62
+
63
+ First-call kernel-compile penalty: ~few seconds per new resolution.
64
+
65
+ ## Differences vs Lance-3B-MLX
66
+
67
+ This is the same architecture as the image variant, with two differences:
68
+ - `model.safetensors`: 26.5 GB (vs 23 GB) β€” extra weights for multi-frame attention
69
+ - `latent_pos_embed.pos_embed`: 31 Γ— 64 Γ— 64 = 126,976 positions (vs 1 Γ— 64 Γ— 64 = 4,096) β€” supports up to 31 latent frames (β‰ˆ 121 video frames @ 4Γ— temporal downsample)
70
+
71
+ T2I via this checkpoint works the same as Lance-3B-MLX. T2V will work once the Wan 2.2 VAE temporal streaming cache is implemented (v0.1.0 of [RockTalk/Wan2.2-VAE-MLX](https://huggingface.co/RockTalk/Wan2.2-VAE-MLX)).
72
 
73
  ## Files
74
 
75
+ | File | Size | Description |
76
+ |---|---|---|
77
+ | `model.safetensors` | 26.5 GB | LLM (Qwen2.5-VL with MoE-gen) + Lance adapters, 1021 tensors |
78
+ | `vit.safetensors` | 1.25 GB | Qwen2.5-VL ViT (for understanding mode β€” Phase 2) |
79
+ | `vae.safetensors` | 2.62 GB | Wan 2.2 VAE (older keying β€” for compatibility; the standalone [RockTalk/Wan2.2-VAE-MLX](https://huggingface.co/RockTalk/Wan2.2-VAE-MLX) uses cleaner keys and is recommended) |
80
+ | `config.json` | β€” | Distilled architecture config |
81
+ | `tokenizer.json`, `vocab.json`, `merges.txt` | β€” | Qwen2.5-VL tokenizer, verbatim |
82
+ | `samples/*.png` | β€” | Verified T2I outputs from this checkpoint |
83
 
84
+ ## Usage
85
 
86
+ Requires `mlx >= 0.29`, `mlx-vlm >= 0.3`, `numpy`, `einops`, `transformers`, `pillow`, and the [`lance-mlx`](https://github.com/RockTalk/Lance-MLX) companion repo for the `Lance` Python class.
87
 
88
+ ```bash
89
+ pip install mlx mlx-vlm numpy einops transformers pillow
90
+ ```
91
 
92
  ```python
93
  import mlx.core as mx
 
94
  from lance_mlx.lance import Lance, LanceConfig
95
+ from lance_mlx.vae_wan22 import Wan2_2_VAE
96
+
97
+ # Build + strict-load (see tools/lance_t2i.py in the companion repo for the
98
+ # full builder; LanceConfig takes a Qwen2.5-VL ModelConfig built from
99
+ # config.json).
100
+ model = Lance(lance_cfg)
101
+ model.load_weights(list(mx.load("model.safetensors").items()), strict=True)
102
+
103
+ vae = Wan2_2_VAE(z_dim=48, c_dim=160, dim_mult=(1, 2, 4, 4),
104
+ temperal_downsample=(False, True, True))
105
+ vae.model.load_weights(list(mx.load("vae.safetensors").items()), strict=True)
106
+
107
+ # Sample
108
+ latent = model.sample_t2i(
109
+ prompt_token_ids=text_ids, # (P,) int32 from tokenizer (no specials)
110
+ latent_shape=(1, 32, 32), # (T_lat, H_lat, W_lat) for 512Γ—512 image
111
+ special_token_ids={"bos": 151644, "eos": 151645,
112
+ "start_of_image": 151652, "end_of_image": 151653,
113
+ "image_token_id": 151655},
114
+ num_steps=30, timestep_shift=3.5, cfg_scale=4.0, seed=0,
115
+ )
116
+ img = vae.decode(latent) # (1, 1, 512, 512, 3) in [-1, 1]
117
+ ```
118
 
119
+ End-to-end script: `tools/lance_t2i.py` in the [companion repo](https://github.com/RockTalk/Lance-MLX).
120
 
121
+ ## How the MoE-gen routing is implemented in MLX
122
+
123
+ Lance's checkpoint contains *two* sets of weights per Qwen2 block:
124
+
125
+ ```
126
+ self_attn.{q,k,v,o}_proj self_attn.{q,k,v,o}_proj_moe_gen
127
+ self_attn.{q,k}_norm self_attn.{q,k}_norm_moe_gen
128
+ mlp.{gate,down,up}_proj mlp_moe_gen.{gate,down,up}_proj
129
+ input_layernorm input_layernorm_moe_gen
130
+ post_attention_layernorm post_attention_layernorm_moe_gen
131
  ```
132
 
133
+ For T2I/T2V the sequence layout is:
134
 
 
 
 
 
 
 
 
 
135
  ```
136
+ <|im_start|> [prompt tokens] <|im_end|> <|vision_start|> [N latent placeholders] <|vision_end|>
137
+ └──── routed through moe_gen β”€β”€β”€β”€β”˜
138
+ ↑ everything else: normal weights
139
+ ```
140
+
141
+ The MLX port (`qwen2_navit_mlx.py`) routes by slicing the sequence into the latent slab vs the surrounding text, applying the appropriate expert to each slab, and concatenating. mrope position ids continue to flow normally across both slabs (with axis-T/H/W coordinates only varying inside the latent slab).
142
+
143
+ ## Conversion source
144
+
145
+ Converted from `bytedance-research/Lance/Lance_3B/*` using the open-source pipeline at https://github.com/RockTalk/Lance-MLX (`tools/convert_weights.py`). Layout transforms:
146
+
147
+ - Conv weights: PT `(O, I, [T,] H, W)` β†’ MLX `(O, [T,] H, W, I)`
148
+ - Embedding weights: shape preserved
149
+ - `lm_head.weight` tied to `embed_tokens.weight` (Qwen default)
150
+ - All `*_moe_gen.*` keys copied verbatim under the same names
151
 
152
  ## License
153
 
154
+ Apache 2.0, inherited from upstream `bytedance-research/Lance`. The Wan 2.2 VAE component is also Apache 2.0 from Alibaba's Wan team.
155
 
156
+ ## Acknowledgements
157
 
158
+ - **ByteDance Research** β€” original Lance training + PT release
159
+ - **Qwen team** β€” Qwen2.5-VL-3B-Instruct backbone
160
+ - **Alibaba Wan team** β€” Wan 2.2 VAE training
161
+ - **Apple `mlx` and `mlx-vlm` teams** β€” the underlying frameworks
162
+ - **This MLX port** β€” RockTalk
163
 
164
+ ## Citation
165
 
166
+ ```bibtex
167
+ @misc{lance_mlx,
168
+ title = {Lance-3B-MLX β€” First MLX port of ByteDance's Lance},
169
+ author = {RockTalk},
170
+ year = {2026},
171
+ url = {https://huggingface.co/RockTalk/Lance-3B-MLX}
172
+ }
173
+ ```
config.json CHANGED
@@ -67,14 +67,15 @@
67
  },
68
  "latent_patch_size": [
69
  1,
70
- 2,
71
- 2
72
  ],
73
- "max_latent_size": 32,
74
- "max_num_frames": 25,
75
  "latent_channel": 48,
76
  "vae_downsample_spatial": 16,
77
  "vae_downsample_temporal": 4,
78
  "connector_act": "gelu_pytorch_tanh",
79
- "timestep_shift": 3.5
 
80
  }
 
67
  },
68
  "latent_patch_size": [
69
  1,
70
+ 1,
71
+ 1
72
  ],
73
+ "max_latent_size": 64,
74
+ "max_num_frames": 120,
75
  "latent_channel": 48,
76
  "vae_downsample_spatial": 16,
77
  "vae_downsample_temporal": 4,
78
  "connector_act": "gelu_pytorch_tanh",
79
+ "timestep_shift": 3.5,
80
+ "max_num_latent_frames": 31
81
  }