xocialize commited on
Commit
1d5963d
·
verified ·
1 Parent(s): bba2b1b

Upload: Lance_3B 8-bit affine quant (retry)

Browse files
README.md ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ library_name: mlx
7
+ pipeline_tag: image-to-image
8
+ tags:
9
+ - mlx
10
+ - apple-silicon
11
+ - lance
12
+ - bytedance
13
+ - multimodal
14
+ - text-to-image
15
+ - image-editing
16
+ - vqa
17
+ - qwen2.5-vl
18
+ - quantized
19
+ - 8-bit
20
+ base_model: bytedance-research/Lance
21
+ ---
22
+
23
+ > 📂 Part of the **[Lance MLX collection](https://huggingface.co/collections/mlx-community/lance-mlx-6a0f3cd5648a74f8283fc8a4)** on mlx-community.
24
+
25
+ # Lance-3B-8bit (MLX, image specialist, 8-bit quantized)
26
+
27
+ 8-bit groupwise affine quantization of [`mlx-community/Lance-3B-bf16`](https://huggingface.co/mlx-community/Lance-3B-bf16), the image-specialist Lance checkpoint. Produced via mlx-lm's `quantize_model` utility with a per-tower skip predicate (`time_embedder`, `llm2vae`, and `vae_in_proj` kept at bf16 for numerical safety; the bulk LLM weights — attention projections, MLP, embeddings, lm_head — quantized).
28
+
29
+ ## Status
30
+
31
+ 🟢 **Production-ready for image tasks on Apple Silicon as of 2026-05-21.**
32
+
33
+ | Capability | Status | Speedup vs bf16 |
34
+ |---|---|---|
35
+ | t2i (text → image) | ✅ Photorealistic, prompt-aligned | **~2.7× faster** (75 s vs 201 s for 768² × 30 steps × CFG=4.0) |
36
+ | image_edit (instruction-based) | ✅ Identity + style preservation | ~2.5× faster expected |
37
+ | x2t_image (image VQA) | ✅ Content-correct | similar / faster |
38
+
39
+ **Memory footprint:** 6.59 GB on disk (53% of the bf16 12.37 GB). Runtime RAM ~8–10 GB, comfortable on a 16 GB Mac.
40
+
41
+ ## Quality notes vs bf16
42
+
43
+ - **Photorealism + content fidelity preserved.** Cats, dragons, portraits, etc., all generate cleanly.
44
+ - **Fine text on generated objects shows slight degradation.** E.g. "STOP" on a sign may render as "SNICS" or similar near-miss. The content is otherwise correct (correct color, correct rectangular sign shape, recognizable text-like glyphs).
45
+ - For prompts that don't require legible in-image text, output is visually indistinguishable from bf16 to a casual eye.
46
+
47
+ ## Quickstart
48
+
49
+ ```python
50
+ from huggingface_hub import snapshot_download
51
+ weights = snapshot_download("mlx-community/Lance-3B-8bit")
52
+ ```
53
+
54
+ ### Text-to-image
55
+
56
+ ```python
57
+ from lance_mlx.pipeline.t2i import TextToImagePipeline
58
+
59
+ pipe = TextToImagePipeline.from_pretrained(
60
+ lance_weights_dir=weights,
61
+ vae_safetensors=f"{weights}/vae.safetensors",
62
+ )
63
+ image = pipe.generate(
64
+ "A photorealistic tabby cat in a sunlit window.",
65
+ height=768, width=768, num_steps=30, cfg_scale=4.0, seed=42,
66
+ )
67
+ image.save("cat.png")
68
+ ```
69
+
70
+ ### Image editing + VQA
71
+
72
+ Same API as the bf16 variant — `ImageEditPipeline` and `UnderstandingPipeline` both pick up the `quantization` block in `config.json` automatically via `lance_mlx.model._loader.load_lance_model`.
73
+
74
+ ## What's quantized vs skipped
75
+
76
+ | Component | Quantization | Why |
77
+ |---|---|---|
78
+ | `embed_tokens` (151,936 × 2,048) | ✅ 8-bit | Big, tolerant |
79
+ | `lm_head` (151,936 × 2,048) | ✅ 8-bit | Big, used in AR decode only |
80
+ | 32 layers × `q/k/v/o_proj` (UND) | ✅ 8-bit | Bulk of LLM compute |
81
+ | 32 layers × `q/k/v/o_proj_moe_gen` (GEN) | ✅ 8-bit | Bulk of GEN compute |
82
+ | 32 layers × `mlp.{up,gate,down}_proj` | ✅ 8-bit | Bulk of LLM compute |
83
+ | 32 layers × `mlp_moe_gen.{up,gate,down}` | ✅ 8-bit | Bulk of GEN compute |
84
+ | `time_embedder.proj_in/out` | ❌ bf16 | Timestep info, numerically sensitive |
85
+ | `llm2vae` (flow head, 2048 × 48) | ❌ bf16 | Tiny + critical to flow prediction |
86
+ | `vae_in_proj.vae2llm` (2048 × 48) | ❌ bf16 | Auto-skipped (input_dim 48 ≠ 64*k) |
87
+ | `latent_pos_embed.pos_embed` | ❌ bf16 | Custom param holder, no `to_quantized` |
88
+ | All RMSNorms + QK-norms | ❌ bf16 | F32 / bf16 norm scales preserved |
89
+ | Wan2.2 VAE (encoder + decoder) | ❌ bf16 | Pixel fidelity matters |
90
+ | Qwen2.5-VL ViT | ❌ bf16 | Semantic fidelity matters for x2t |
91
+
92
+ Recipe: 8-bit affine, group_size 64. `quantization_report.json` in this repo has full provenance.
93
+
94
+ ## Why no Video 8-bit yet
95
+
96
+ The video specialist (`Lance_3B_Video`) does **not** quantize cleanly to 8-bit with this recipe — t2v output collapses to a gray gradient regardless of whether the GEN tower is included or skipped, and finer group_sizes don't help. The video-specialist fine-tune has different weight distributions that affine 8-bit can't capture.
97
+
98
+ Reza2kn/lance-quant's findings suggest **DWQ (dynamic weight quantization)** with calibration is the right approach for Lance video at 8-bit and below. That's a Phase 5c project. For now, use [`mlx-community/Lance-3B-Video-bf16`](https://huggingface.co/mlx-community/Lance-3B-Video-bf16) at bf16 for video tasks.
99
+
100
+ ## Files in this repo
101
+
102
+ | File | Size | Notes |
103
+ |---|---|---|
104
+ | `model.safetensors` | 6.59 GB | Quantized LLM weights (2033 tensors: each Linear becomes weight + scales + biases) |
105
+ | `vit.safetensors` | 1.34 GB | bf16 (not quantized) |
106
+ | `vae.safetensors` | 1.41 GB | bf16 (not quantized) |
107
+ | `config.json` | – | With `quantization` block (`bits=8, group_size=64, mode=affine`) |
108
+ | `quantization_report.json` | – | Provenance + footprint stats |
109
+ | `tokenizer.json` / `vocab.json` | – | Qwen2.5-VL vocabulary |
110
+
111
+ ## Architecture (same as the bf16 variant)
112
+
113
+ See [`mlx-community/Lance-3B-bf16`](https://huggingface.co/mlx-community/Lance-3B-bf16) for the full architecture description.
114
+
115
+ ## License
116
+
117
+ This MLX port + quantization: **Apache 2.0**.
118
+
119
+ Underlying weights:
120
+ - Lance: Apache 2.0 (ByteDance Intelligent Creation Lab).
121
+ - Wan2.2 VAE: Apache 2.0 (Alibaba).
122
+ - Qwen2.5-VL: Apache 2.0 (Alibaba).
123
+
124
+ ## Citation
125
+
126
+ ```bibtex
127
+ @article{fu2026lance,
128
+ title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
129
+ author={Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and others},
130
+ journal={arXiv preprint arXiv:2605.18678},
131
+ year={2026}
132
+ }
133
+ ```
134
+
135
+ ## Links
136
+
137
+ - **MLX port code:** [`github.com/xocialize/lance-mlx`](https://github.com/xocialize/lance-mlx)
138
+ - **bf16 source:** [`mlx-community/Lance-3B-bf16`](https://huggingface.co/mlx-community/Lance-3B-bf16)
139
+ - **Standalone VAE:** [`mlx-community/Wan2.2-VAE-Lance-bf16`](https://huggingface.co/mlx-community/Wan2.2-VAE-Lance-bf16)
140
+ - **Video specialist (bf16, alpha 8-bit pending):** [`mlx-community/Lance-3B-Video-bf16`](https://huggingface.co/mlx-community/Lance-3B-Video-bf16)
config.json ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen2_5_VLForConditionalGeneration"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "bos_token_id": 151643,
7
+ "eos_token_id": 151645,
8
+ "vision_start_token_id": 151652,
9
+ "vision_end_token_id": 151653,
10
+ "vision_token_id": 151654,
11
+ "image_token_id": 151655,
12
+ "video_token_id": 151656,
13
+ "hidden_act": "silu",
14
+ "hidden_size": 2048,
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 11008,
17
+ "max_position_embeddings": 128000,
18
+ "max_window_layers": 70,
19
+ "model_type": "qwen2_5_vl",
20
+ "num_attention_heads": 16,
21
+ "num_hidden_layers": 36,
22
+ "num_key_value_heads": 2,
23
+ "rms_norm_eps": 1e-06,
24
+ "rope_theta": 1000000.0,
25
+ "sliding_window": 32768,
26
+ "tie_word_embeddings": false,
27
+ "torch_dtype": "bfloat16",
28
+ "transformers_version": "4.41.2",
29
+ "use_cache": true,
30
+ "use_sliding_window": false,
31
+ "vision_config": {
32
+ "depth": 32,
33
+ "hidden_act": "silu",
34
+ "hidden_size": 1280,
35
+ "intermediate_size": 3420,
36
+ "num_heads": 16,
37
+ "in_chans": 3,
38
+ "out_hidden_size": 2048,
39
+ "patch_size": 14,
40
+ "spatial_merge_size": 2,
41
+ "spatial_patch_size": 14,
42
+ "window_size": 112,
43
+ "fullatt_block_indexes": [
44
+ 7,
45
+ 15,
46
+ 23,
47
+ 31
48
+ ],
49
+ "tokens_per_second": 2,
50
+ "temporal_patch_size": 2
51
+ },
52
+ "rope_scaling": {
53
+ "type": "mrope",
54
+ "mrope_section": [
55
+ 16,
56
+ 24,
57
+ 24
58
+ ]
59
+ },
60
+ "vocab_size": 151936,
61
+ "quantization": {
62
+ "group_size": 64,
63
+ "bits": 8,
64
+ "mode": "affine"
65
+ },
66
+ "quantization_config": {
67
+ "group_size": 64,
68
+ "bits": 8,
69
+ "mode": "affine"
70
+ }
71
+ }
generation_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 151643,
3
+ "pad_token_id": 151643,
4
+ "do_sample": true,
5
+ "eos_token_id": [
6
+ 151645,
7
+ 151643
8
+ ],
9
+ "repetition_penalty": 1.05,
10
+ "temperature": 0.000001,
11
+ "transformers_version": "4.49.0"
12
+ }
llm_config.json ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen2_5_VLForConditionalGeneration"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "bos_token_id": 151643,
7
+ "eos_token_id": 151645,
8
+ "vision_start_token_id": 151652,
9
+ "vision_end_token_id": 151653,
10
+ "vision_token_id": 151654,
11
+ "image_token_id": 151655,
12
+ "video_token_id": 151656,
13
+ "hidden_act": "silu",
14
+ "hidden_size": 2048,
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 11008,
17
+ "max_position_embeddings": 128000,
18
+ "max_window_layers": 70,
19
+ "model_type": "qwen2_5_vl",
20
+ "num_attention_heads": 16,
21
+ "num_hidden_layers": 36,
22
+ "num_key_value_heads": 2,
23
+ "rms_norm_eps": 1e-06,
24
+ "rope_theta": 1000000.0,
25
+ "sliding_window": 32768,
26
+ "tie_word_embeddings": true,
27
+ "torch_dtype": "bfloat16",
28
+ "transformers_version": "4.41.2",
29
+ "use_cache": true,
30
+ "use_sliding_window": false,
31
+ "vision_config": {
32
+ "depth": 32,
33
+ "hidden_act": "silu",
34
+ "hidden_size": 1280,
35
+ "intermediate_size": 3420,
36
+ "num_heads": 16,
37
+ "in_chans": 3,
38
+ "out_hidden_size": 2048,
39
+ "patch_size": 14,
40
+ "spatial_merge_size": 2,
41
+ "spatial_patch_size": 14,
42
+ "window_size": 112,
43
+ "fullatt_block_indexes": [
44
+ 7,
45
+ 15,
46
+ 23,
47
+ 31
48
+ ],
49
+ "tokens_per_second": 2,
50
+ "temporal_patch_size": 2
51
+ },
52
+ "rope_scaling": {
53
+ "type": "mrope",
54
+ "mrope_section": [
55
+ 16,
56
+ 24,
57
+ 24
58
+ ]
59
+ },
60
+ "vocab_size": 151936
61
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3cf99c5fb64a6663b5cc04eea73e78ea8920a1e16fcecabc509a3da335d3c072
3
+ size 6585590531
quantization_report.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "source_dir": "/Volumes/DEV_VOL1/VideoResearch/lance-mlx-models/Lance-3B-bf16",
3
+ "bits": 8,
4
+ "group_size": 64,
5
+ "mode": "affine",
6
+ "bf16_bytes": 12371046496,
7
+ "quantized_bytes": 6585364576,
8
+ "compression_ratio": 0.5323207360128573,
9
+ "n_tensors_bf16": 1021,
10
+ "n_tensors_quant": 2033,
11
+ "skip_patterns": [
12
+ "time_embedder.proj_in",
13
+ "time_embedder.proj_out",
14
+ "llm2vae"
15
+ ]
16
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
vae.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:707e20bb83bdffff77774e04275d64b5ee8660f98390ce362538078d020b6807
3
+ size 1409401642
vit.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4abfe7f4b7a22d2119a11ff678f6dbc8ff310d6a10f4a0e019ce87ae3c1721ee
3
+ size 1337407631
vocab.json ADDED
The diff for this file is too large to render. See raw diff