--- license: apache-2.0 language: - en - zh library_name: mlx pipeline_tag: image-to-image tags: - mlx - apple-silicon - lance - bytedance - multimodal - text-to-image - image-editing - vqa - qwen2.5-vl - quantized - 8-bit base_model: bytedance-research/Lance --- > ⚠️ **SUPERSEDED — DO NOT USE.** This 8-bit checkpoint produces visibly degraded t2i > output (ghost subject + rainbow striped artifacts vs bf16). Kept on HF for historical > reproducibility of the May 2026 quantization research record only. > > **What to use instead:** > - For full-quality `t2i` / `image_edit` / `x2t_image`: [`mlx-community/Lance-3B-bf16`](https://huggingface.co/mlx-community/Lance-3B-bf16) (~15 GB) > - For compressed `x2t_image` (VQA) on 8-16 GB Macs: [`mlx-community/Lance-3B-AWQ-INT4`](https://huggingface.co/mlx-community/Lance-3B-AWQ-INT4) (5.65 GB repo, 3.31 GB LLM, 6-9× faster decode) > - For image generation on small RAM: **no quantized variant is shippable** — use bf16 on a Mac that fits it. Phase 5c-3h showed the 80% HF detail loss is architectural (forward-pass error compounding through Lance's 2,160 evaluations per image), not a quant-scheme problem. > 🎓 **Quantization research closed (2026-05-26).** The May 2026 effort > investigated naive groupwise 4/8-bit, DWQ (4-bit UND-only), and AWQ > (4-bit + 8-bit, full + UND-only) across multiple configurations. AWQ math > is correct per-Linear (Phase 5c-3h empirical confirmation: -28% output MSE > average at 8-bit) but per-step quant improvements don't compound through > Lance's flow-matching architecture. No quant scheme tested would close > the t2i gap; k-quants from llama.cpp would face the same compounding > problem. **Lance-3B-AWQ-INT4 is the final shipping outcome — VQA only.** > Full research record: [`xocialize/lance-mlx`](https://github.com/xocialize/lance-mlx) > under `notes/phase5n_diagnostics/phase5c3_awq_port/`. --- > 📂 Part of the **[Lance MLX collection](https://huggingface.co/collections/mlx-community/lance-mlx-6a0f3cd5648a74f8283fc8a4)** on mlx-community. # Lance-3B-8bit (MLX, image specialist, 8-bit quantized) 8-bit groupwise affine quantization of [`mlx-community/Lance-3B-bf16`](https://huggingface.co/mlx-community/Lance-3B-bf16), the image-specialist Lance checkpoint. Produced via mlx-lm's `quantize_model` utility with a per-tower skip predicate (`time_embedder`, `llm2vae`, and `vae_in_proj` kept at bf16 for numerical safety; the bulk LLM weights — attention projections, MLP, embeddings, lm_head — quantized). ## Status 🟢 **Production-ready for image tasks on Apple Silicon as of 2026-05-21.** | Capability | Status | Speedup vs bf16 | |---|---|---| | t2i (text → image) | ✅ Photorealistic, prompt-aligned | **~2.7× faster** (75 s vs 201 s for 768² × 30 steps × CFG=4.0) | | image_edit (instruction-based) | ✅ Identity + style preservation | ~2.5× faster expected | | x2t_image (image VQA) | ✅ Content-correct | similar / faster | **Memory footprint:** 6.59 GB on disk (53% of the bf16 12.37 GB). Runtime RAM ~8–10 GB, comfortable on a 16 GB Mac. ## Quality notes vs bf16 - **Photorealism + content fidelity preserved.** Cats, dragons, portraits, etc., all generate cleanly. - **Fine text on generated objects shows slight degradation.** E.g. "STOP" on a sign may render as "SNICS" or similar near-miss. The content is otherwise correct (correct color, correct rectangular sign shape, recognizable text-like glyphs). - For prompts that don't require legible in-image text, output is visually indistinguishable from bf16 to a casual eye. ## Quickstart ```python from huggingface_hub import snapshot_download weights = snapshot_download("mlx-community/Lance-3B-8bit") ``` ### Text-to-image ```python from lance_mlx.pipeline.t2i import TextToImagePipeline pipe = TextToImagePipeline.from_pretrained( lance_weights_dir=weights, vae_safetensors=f"{weights}/vae.safetensors", ) image = pipe.generate( "A photorealistic tabby cat in a sunlit window.", height=768, width=768, num_steps=30, cfg_scale=4.0, seed=42, ) image.save("cat.png") ``` ### Image editing + VQA Same API as the bf16 variant — `ImageEditPipeline` and `UnderstandingPipeline` both pick up the `quantization` block in `config.json` automatically via `lance_mlx.model._loader.load_lance_model`. ## What's quantized vs skipped | Component | Quantization | Why | |---|---|---| | `embed_tokens` (151,936 × 2,048) | ✅ 8-bit | Big, tolerant | | `lm_head` (151,936 × 2,048) | ✅ 8-bit | Big, used in AR decode only | | 32 layers × `q/k/v/o_proj` (UND) | ✅ 8-bit | Bulk of LLM compute | | 32 layers × `q/k/v/o_proj_moe_gen` (GEN) | ✅ 8-bit | Bulk of GEN compute | | 32 layers × `mlp.{up,gate,down}_proj` | ✅ 8-bit | Bulk of LLM compute | | 32 layers × `mlp_moe_gen.{up,gate,down}` | ✅ 8-bit | Bulk of GEN compute | | `time_embedder.proj_in/out` | ❌ bf16 | Timestep info, numerically sensitive | | `llm2vae` (flow head, 2048 × 48) | ❌ bf16 | Tiny + critical to flow prediction | | `vae_in_proj.vae2llm` (2048 × 48) | ❌ bf16 | Auto-skipped (input_dim 48 ≠ 64*k) | | `latent_pos_embed.pos_embed` | ❌ bf16 | Custom param holder, no `to_quantized` | | All RMSNorms + QK-norms | ❌ bf16 | F32 / bf16 norm scales preserved | | Wan2.2 VAE (encoder + decoder) | ❌ bf16 | Pixel fidelity matters | | Qwen2.5-VL ViT | ❌ bf16 | Semantic fidelity matters for x2t | Recipe: 8-bit affine, group_size 64. `quantization_report.json` in this repo has full provenance. ## Why no Video 8-bit yet The video specialist (`Lance_3B_Video`) does **not** quantize cleanly to 8-bit with this recipe — t2v output collapses to a gray gradient regardless of whether the GEN tower is included or skipped, and finer group_sizes don't help. The video-specialist fine-tune has different weight distributions that affine 8-bit can't capture. Reza2kn/lance-quant's findings suggest **DWQ (dynamic weight quantization)** with calibration is the right approach for Lance video at 8-bit and below. That's a Phase 5c project. For now, use [`mlx-community/Lance-3B-Video-bf16`](https://huggingface.co/mlx-community/Lance-3B-Video-bf16) at bf16 for video tasks. ## Files in this repo | File | Size | Notes | |---|---|---| | `model.safetensors` | 6.59 GB | Quantized LLM weights (2033 tensors: each Linear becomes weight + scales + biases) | | `vit.safetensors` | 1.34 GB | bf16 (not quantized) | | `vae.safetensors` | 1.41 GB | bf16 (not quantized) | | `config.json` | – | With `quantization` block (`bits=8, group_size=64, mode=affine`) | | `quantization_report.json` | – | Provenance + footprint stats | | `tokenizer.json` / `vocab.json` | – | Qwen2.5-VL vocabulary | ## Architecture (same as the bf16 variant) See [`mlx-community/Lance-3B-bf16`](https://huggingface.co/mlx-community/Lance-3B-bf16) for the full architecture description. ## License This MLX port + quantization: **Apache 2.0**. Underlying weights: - Lance: Apache 2.0 (ByteDance Intelligent Creation Lab). - Wan2.2 VAE: Apache 2.0 (Alibaba). - Qwen2.5-VL: Apache 2.0 (Alibaba). ## Citation ```bibtex @article{fu2026lance, title={Lance: Unified Multimodal Modeling by Multi-Task Synergy}, author={Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and others}, journal={arXiv preprint arXiv:2605.18678}, year={2026} } ``` ## Links - **MLX port code:** [`github.com/xocialize/lance-mlx`](https://github.com/xocialize/lance-mlx) - **bf16 source:** [`mlx-community/Lance-3B-bf16`](https://huggingface.co/mlx-community/Lance-3B-bf16) - **Standalone VAE:** [`mlx-community/Wan2.2-VAE-Lance-bf16`](https://huggingface.co/mlx-community/Wan2.2-VAE-Lance-bf16) - **Video specialist (bf16, alpha 8-bit pending):** [`mlx-community/Lance-3B-Video-bf16`](https://huggingface.co/mlx-community/Lance-3B-Video-bf16)