Upload: Lance_3B 8-bit affine quant (retry)

Browse files

Files changed (10) hide show

README.md +140 -0
config.json +71 -0
generation_config.json +12 -0
llm_config.json +61 -0
model.safetensors +3 -0
quantization_report.json +16 -0
tokenizer.json +0 -0
vae.safetensors +3 -0
vit.safetensors +3 -0
vocab.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,140 @@

+---
+license: apache-2.0
+language:
+- en
+- zh
+library_name: mlx
+pipeline_tag: image-to-image
+tags:
+- mlx
+- apple-silicon
+- lance
+- bytedance
+- multimodal
+- text-to-image
+- image-editing
+- vqa
+- qwen2.5-vl
+- quantized
+- 8-bit
+base_model: bytedance-research/Lance
+---
+> 📂 Part of the **[Lance MLX collection](https://huggingface.co/collections/mlx-community/lance-mlx-6a0f3cd5648a74f8283fc8a4)** on mlx-community.
+# Lance-3B-8bit (MLX, image specialist, 8-bit quantized)
+8-bit groupwise affine quantization of [`mlx-community/Lance-3B-bf16`](https://huggingface.co/mlx-community/Lance-3B-bf16), the image-specialist Lance checkpoint. Produced via mlx-lm's `quantize_model` utility with a per-tower skip predicate (`time_embedder`, `llm2vae`, and `vae_in_proj` kept at bf16 for numerical safety; the bulk LLM weights — attention projections, MLP, embeddings, lm_head — quantized).
+## Status
+🟢 **Production-ready for image tasks on Apple Silicon as of 2026-05-21.**
+| Capability | Status | Speedup vs bf16 |
+|---|---|---|
+| t2i (text → image) | ✅ Photorealistic, prompt-aligned | **~2.7× faster** (75 s vs 201 s for 768² × 30 steps × CFG=4.0) |
+| image_edit (instruction-based) | ✅ Identity + style preservation | ~2.5× faster expected |
+| x2t_image (image VQA) | ✅ Content-correct | similar / faster |
+**Memory footprint:** 6.59 GB on disk (53% of the bf16 12.37 GB). Runtime RAM ~8–10 GB, comfortable on a 16 GB Mac.
+## Quality notes vs bf16
+- **Photorealism + content fidelity preserved.** Cats, dragons, portraits, etc., all generate cleanly.
+- **Fine text on generated objects shows slight degradation.** E.g. "STOP" on a sign may render as "SNICS" or similar near-miss. The content is otherwise correct (correct color, correct rectangular sign shape, recognizable text-like glyphs).
+- For prompts that don't require legible in-image text, output is visually indistinguishable from bf16 to a casual eye.
+## Quickstart
+```python
+from huggingface_hub import snapshot_download
+weights = snapshot_download("mlx-community/Lance-3B-8bit")
+```
+### Text-to-image
+```python
+from lance_mlx.pipeline.t2i import TextToImagePipeline
+pipe = TextToImagePipeline.from_pretrained(
+    lance_weights_dir=weights,
+    vae_safetensors=f"{weights}/vae.safetensors",
+)
+image = pipe.generate(
+    "A photorealistic tabby cat in a sunlit window.",
+    height=768, width=768, num_steps=30, cfg_scale=4.0, seed=42,
+)
+image.save("cat.png")
+```
+### Image editing + VQA
+Same API as the bf16 variant — `ImageEditPipeline` and `UnderstandingPipeline` both pick up the `quantization` block in `config.json` automatically via `lance_mlx.model._loader.load_lance_model`.
+## What's quantized vs skipped
+| Component | Quantization | Why |
+|---|---|---|
+| `embed_tokens` (151,936 × 2,048) | ✅ 8-bit | Big, tolerant |
+| `lm_head` (151,936 × 2,048) | ✅ 8-bit | Big, used in AR decode only |
+| 32 layers × `q/k/v/o_proj` (UND) | ✅ 8-bit | Bulk of LLM compute |
+| 32 layers × `q/k/v/o_proj_moe_gen` (GEN) | ✅ 8-bit | Bulk of GEN compute |
+| 32 layers × `mlp.{up,gate,down}_proj` | ✅ 8-bit | Bulk of LLM compute |
+| 32 layers × `mlp_moe_gen.{up,gate,down}` | ✅ 8-bit | Bulk of GEN compute |
+| `time_embedder.proj_in/out` | ❌ bf16 | Timestep info, numerically sensitive |
+| `llm2vae` (flow head, 2048 × 48) | ❌ bf16 | Tiny + critical to flow prediction |
+| `vae_in_proj.vae2llm` (2048 × 48) | ❌ bf16 | Auto-skipped (input_dim 48 ≠ 64*k) |
+| `latent_pos_embed.pos_embed` | ❌ bf16 | Custom param holder, no `to_quantized` |
+| All RMSNorms + QK-norms | ❌ bf16 | F32 / bf16 norm scales preserved |
+| Wan2.2 VAE (encoder + decoder) | ❌ bf16 | Pixel fidelity matters |
+| Qwen2.5-VL ViT | ❌ bf16 | Semantic fidelity matters for x2t |
+Recipe: 8-bit affine, group_size 64. `quantization_report.json` in this repo has full provenance.
+## Why no Video 8-bit yet
+The video specialist (`Lance_3B_Video`) does **not** quantize cleanly to 8-bit with this recipe — t2v output collapses to a gray gradient regardless of whether the GEN tower is included or skipped, and finer group_sizes don't help. The video-specialist fine-tune has different weight distributions that affine 8-bit can't capture.
+Reza2kn/lance-quant's findings suggest **DWQ (dynamic weight quantization)** with calibration is the right approach for Lance video at 8-bit and below. That's a Phase 5c project. For now, use [`mlx-community/Lance-3B-Video-bf16`](https://huggingface.co/mlx-community/Lance-3B-Video-bf16) at bf16 for video tasks.
+## Files in this repo
+| File | Size | Notes |
+|---|---|---|
+| `model.safetensors` | 6.59 GB | Quantized LLM weights (2033 tensors: each Linear becomes weight + scales + biases) |
+| `vit.safetensors` | 1.34 GB | bf16 (not quantized) |
+| `vae.safetensors` | 1.41 GB | bf16 (not quantized) |
+| `config.json` | – | With `quantization` block (`bits=8, group_size=64, mode=affine`) |
+| `quantization_report.json` | – | Provenance + footprint stats |
+| `tokenizer.json` / `vocab.json` | – | Qwen2.5-VL vocabulary |
+## Architecture (same as the bf16 variant)
+See [`mlx-community/Lance-3B-bf16`](https://huggingface.co/mlx-community/Lance-3B-bf16) for the full architecture description.
+## License
+This MLX port + quantization: **Apache 2.0**.
+Underlying weights:
+- Lance: Apache 2.0 (ByteDance Intelligent Creation Lab).
+- Wan2.2 VAE: Apache 2.0 (Alibaba).
+- Qwen2.5-VL: Apache 2.0 (Alibaba).
+## Citation
+```bibtex
+@article{fu2026lance,
+  title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
+  author={Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and others},
+  journal={arXiv preprint arXiv:2605.18678},
+  year={2026}
+}
+```
+## Links
+- **MLX port code:** [`github.com/xocialize/lance-mlx`](https://github.com/xocialize/lance-mlx)
+- **bf16 source:** [`mlx-community/Lance-3B-bf16`](https://huggingface.co/mlx-community/Lance-3B-bf16)
+- **Standalone VAE:** [`mlx-community/Wan2.2-VAE-Lance-bf16`](https://huggingface.co/mlx-community/Wan2.2-VAE-Lance-bf16)
+- **Video specialist (bf16, alpha 8-bit pending):** [`mlx-community/Lance-3B-Video-bf16`](https://huggingface.co/mlx-community/Lance-3B-Video-bf16)

config.json ADDED Viewed

	@@ -0,0 +1,71 @@

+{
+  "architectures": [
+    "Qwen2_5_VLForConditionalGeneration"
+  ],
+  "attention_dropout": 0.0,
+  "bos_token_id": 151643,
+  "eos_token_id": 151645,
+  "vision_start_token_id": 151652,
+  "vision_end_token_id": 151653,
+  "vision_token_id": 151654,
+  "image_token_id": 151655,
+  "video_token_id": 151656,
+  "hidden_act": "silu",
+  "hidden_size": 2048,
+  "initializer_range": 0.02,
+  "intermediate_size": 11008,
+  "max_position_embeddings": 128000,
+  "max_window_layers": 70,
+  "model_type": "qwen2_5_vl",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 36,
+  "num_key_value_heads": 2,
+  "rms_norm_eps": 1e-06,
+  "rope_theta": 1000000.0,
+  "sliding_window": 32768,
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.41.2",
+  "use_cache": true,
+  "use_sliding_window": false,
+  "vision_config": {
+    "depth": 32,
+    "hidden_act": "silu",
+    "hidden_size": 1280,
+    "intermediate_size": 3420,
+    "num_heads": 16,
+    "in_chans": 3,
+    "out_hidden_size": 2048,
+    "patch_size": 14,
+    "spatial_merge_size": 2,
+    "spatial_patch_size": 14,
+    "window_size": 112,
+    "fullatt_block_indexes": [
+      7,
+      15,
+      23,
+      31
+    ],
+    "tokens_per_second": 2,
+    "temporal_patch_size": 2
+  },
+  "rope_scaling": {
+    "type": "mrope",
+    "mrope_section": [
+      16,
+      24,
+      24
+    ]
+  },
+  "vocab_size": 151936,
+  "quantization": {
+    "group_size": 64,
+    "bits": 8,
+    "mode": "affine"
+  },
+  "quantization_config": {
+    "group_size": 64,
+    "bits": 8,
+    "mode": "affine"
+  }
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "bos_token_id": 151643,
+  "pad_token_id": 151643,
+  "do_sample": true,
+  "eos_token_id": [
+    151645,
+    151643
+  ],
+  "repetition_penalty": 1.05,
+  "temperature": 0.000001,
+  "transformers_version": "4.49.0"
+}

llm_config.json ADDED Viewed

	@@ -0,0 +1,61 @@

+{
+  "architectures": [
+    "Qwen2_5_VLForConditionalGeneration"
+  ],
+  "attention_dropout": 0.0,
+  "bos_token_id": 151643,
+  "eos_token_id": 151645,
+  "vision_start_token_id": 151652,
+  "vision_end_token_id": 151653,
+  "vision_token_id": 151654,
+  "image_token_id": 151655,
+  "video_token_id": 151656,
+  "hidden_act": "silu",
+  "hidden_size": 2048,
+  "initializer_range": 0.02,
+  "intermediate_size": 11008,
+  "max_position_embeddings": 128000,
+  "max_window_layers": 70,
+  "model_type": "qwen2_5_vl",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 36,
+  "num_key_value_heads": 2,
+  "rms_norm_eps": 1e-06,
+  "rope_theta": 1000000.0,
+  "sliding_window": 32768,
+  "tie_word_embeddings": true,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.41.2",
+  "use_cache": true,
+  "use_sliding_window": false,
+  "vision_config": {
+    "depth": 32,
+    "hidden_act": "silu",
+    "hidden_size": 1280,
+    "intermediate_size": 3420,
+    "num_heads": 16,
+    "in_chans": 3,
+    "out_hidden_size": 2048,
+    "patch_size": 14,
+    "spatial_merge_size": 2,
+    "spatial_patch_size": 14,
+    "window_size": 112,
+    "fullatt_block_indexes": [
+      7,
+      15,
+      23,
+      31
+    ],
+    "tokens_per_second": 2,
+    "temporal_patch_size": 2
+  },
+  "rope_scaling": {
+    "type": "mrope",
+    "mrope_section": [
+      16,
+      24,
+      24
+    ]
+  },
+  "vocab_size": 151936
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3cf99c5fb64a6663b5cc04eea73e78ea8920a1e16fcecabc509a3da335d3c072
+size 6585590531

quantization_report.json ADDED Viewed

	@@ -0,0 +1,16 @@

+{
+  "source_dir": "/Volumes/DEV_VOL1/VideoResearch/lance-mlx-models/Lance-3B-bf16",
+  "bits": 8,
+  "group_size": 64,
+  "mode": "affine",
+  "bf16_bytes": 12371046496,
+  "quantized_bytes": 6585364576,
+  "compression_ratio": 0.5323207360128573,
+  "n_tensors_bf16": 1021,
+  "n_tensors_quant": 2033,
+  "skip_patterns": [
+    "time_embedder.proj_in",
+    "time_embedder.proj_out",
+    "llm2vae"
+  ]
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

vae.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:707e20bb83bdffff77774e04275d64b5ee8660f98390ce362538078d020b6807
+size 1409401642

vit.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4abfe7f4b7a22d2119a11ff678f6dbc8ff310d6a10f4a0e019ce87ae3c1721ee
+size 1337407631

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff