Lance-3B-8bit / README.md
xocialize's picture
Update banner: mark superseded by Lance-3B-AWQ-INT4 for VQA; reflect 5c-3h research closure
56bcf8e verified
---
license: apache-2.0
language:
- en
- zh
library_name: mlx
pipeline_tag: image-to-image
tags:
- mlx
- apple-silicon
- lance
- bytedance
- multimodal
- text-to-image
- image-editing
- vqa
- qwen2.5-vl
- quantized
- 8-bit
base_model: bytedance-research/Lance
---
> ⚠️ **SUPERSEDED β€” DO NOT USE.** This 8-bit checkpoint produces visibly degraded t2i
> output (ghost subject + rainbow striped artifacts vs bf16). Kept on HF for historical
> reproducibility of the May 2026 quantization research record only.
>
> **What to use instead:**
> - For full-quality `t2i` / `image_edit` / `x2t_image`: [`mlx-community/Lance-3B-bf16`](https://huggingface.co/mlx-community/Lance-3B-bf16) (~15 GB)
> - For compressed `x2t_image` (VQA) on 8-16 GB Macs: [`mlx-community/Lance-3B-AWQ-INT4`](https://huggingface.co/mlx-community/Lance-3B-AWQ-INT4) (5.65 GB repo, 3.31 GB LLM, 6-9Γ— faster decode)
> - For image generation on small RAM: **no quantized variant is shippable** β€” use bf16 on a Mac that fits it. Phase 5c-3h showed the 80% HF detail loss is architectural (forward-pass error compounding through Lance's 2,160 evaluations per image), not a quant-scheme problem.
> πŸŽ“ **Quantization research closed (2026-05-26).** The May 2026 effort
> investigated naive groupwise 4/8-bit, DWQ (4-bit UND-only), and AWQ
> (4-bit + 8-bit, full + UND-only) across multiple configurations. AWQ math
> is correct per-Linear (Phase 5c-3h empirical confirmation: -28% output MSE
> average at 8-bit) but per-step quant improvements don't compound through
> Lance's flow-matching architecture. No quant scheme tested would close
> the t2i gap; k-quants from llama.cpp would face the same compounding
> problem. **Lance-3B-AWQ-INT4 is the final shipping outcome β€” VQA only.**
> Full research record: [`xocialize/lance-mlx`](https://github.com/xocialize/lance-mlx)
> under `notes/phase5n_diagnostics/phase5c3_awq_port/`.
---
> πŸ“‚ Part of the **[Lance MLX collection](https://huggingface.co/collections/mlx-community/lance-mlx-6a0f3cd5648a74f8283fc8a4)** on mlx-community.
# Lance-3B-8bit (MLX, image specialist, 8-bit quantized)
8-bit groupwise affine quantization of [`mlx-community/Lance-3B-bf16`](https://huggingface.co/mlx-community/Lance-3B-bf16), the image-specialist Lance checkpoint. Produced via mlx-lm's `quantize_model` utility with a per-tower skip predicate (`time_embedder`, `llm2vae`, and `vae_in_proj` kept at bf16 for numerical safety; the bulk LLM weights β€” attention projections, MLP, embeddings, lm_head β€” quantized).
## Status
🟒 **Production-ready for image tasks on Apple Silicon as of 2026-05-21.**
| Capability | Status | Speedup vs bf16 |
|---|---|---|
| t2i (text β†’ image) | βœ… Photorealistic, prompt-aligned | **~2.7Γ— faster** (75 s vs 201 s for 768Β² Γ— 30 steps Γ— CFG=4.0) |
| image_edit (instruction-based) | βœ… Identity + style preservation | ~2.5Γ— faster expected |
| x2t_image (image VQA) | βœ… Content-correct | similar / faster |
**Memory footprint:** 6.59 GB on disk (53% of the bf16 12.37 GB). Runtime RAM ~8–10 GB, comfortable on a 16 GB Mac.
## Quality notes vs bf16
- **Photorealism + content fidelity preserved.** Cats, dragons, portraits, etc., all generate cleanly.
- **Fine text on generated objects shows slight degradation.** E.g. "STOP" on a sign may render as "SNICS" or similar near-miss. The content is otherwise correct (correct color, correct rectangular sign shape, recognizable text-like glyphs).
- For prompts that don't require legible in-image text, output is visually indistinguishable from bf16 to a casual eye.
## Quickstart
```python
from huggingface_hub import snapshot_download
weights = snapshot_download("mlx-community/Lance-3B-8bit")
```
### Text-to-image
```python
from lance_mlx.pipeline.t2i import TextToImagePipeline
pipe = TextToImagePipeline.from_pretrained(
lance_weights_dir=weights,
vae_safetensors=f"{weights}/vae.safetensors",
)
image = pipe.generate(
"A photorealistic tabby cat in a sunlit window.",
height=768, width=768, num_steps=30, cfg_scale=4.0, seed=42,
)
image.save("cat.png")
```
### Image editing + VQA
Same API as the bf16 variant β€” `ImageEditPipeline` and `UnderstandingPipeline` both pick up the `quantization` block in `config.json` automatically via `lance_mlx.model._loader.load_lance_model`.
## What's quantized vs skipped
| Component | Quantization | Why |
|---|---|---|
| `embed_tokens` (151,936 Γ— 2,048) | βœ… 8-bit | Big, tolerant |
| `lm_head` (151,936 Γ— 2,048) | βœ… 8-bit | Big, used in AR decode only |
| 32 layers Γ— `q/k/v/o_proj` (UND) | βœ… 8-bit | Bulk of LLM compute |
| 32 layers Γ— `q/k/v/o_proj_moe_gen` (GEN) | βœ… 8-bit | Bulk of GEN compute |
| 32 layers Γ— `mlp.{up,gate,down}_proj` | βœ… 8-bit | Bulk of LLM compute |
| 32 layers Γ— `mlp_moe_gen.{up,gate,down}` | βœ… 8-bit | Bulk of GEN compute |
| `time_embedder.proj_in/out` | ❌ bf16 | Timestep info, numerically sensitive |
| `llm2vae` (flow head, 2048 Γ— 48) | ❌ bf16 | Tiny + critical to flow prediction |
| `vae_in_proj.vae2llm` (2048 Γ— 48) | ❌ bf16 | Auto-skipped (input_dim 48 β‰  64*k) |
| `latent_pos_embed.pos_embed` | ❌ bf16 | Custom param holder, no `to_quantized` |
| All RMSNorms + QK-norms | ❌ bf16 | F32 / bf16 norm scales preserved |
| Wan2.2 VAE (encoder + decoder) | ❌ bf16 | Pixel fidelity matters |
| Qwen2.5-VL ViT | ❌ bf16 | Semantic fidelity matters for x2t |
Recipe: 8-bit affine, group_size 64. `quantization_report.json` in this repo has full provenance.
## Why no Video 8-bit yet
The video specialist (`Lance_3B_Video`) does **not** quantize cleanly to 8-bit with this recipe β€” t2v output collapses to a gray gradient regardless of whether the GEN tower is included or skipped, and finer group_sizes don't help. The video-specialist fine-tune has different weight distributions that affine 8-bit can't capture.
Reza2kn/lance-quant's findings suggest **DWQ (dynamic weight quantization)** with calibration is the right approach for Lance video at 8-bit and below. That's a Phase 5c project. For now, use [`mlx-community/Lance-3B-Video-bf16`](https://huggingface.co/mlx-community/Lance-3B-Video-bf16) at bf16 for video tasks.
## Files in this repo
| File | Size | Notes |
|---|---|---|
| `model.safetensors` | 6.59 GB | Quantized LLM weights (2033 tensors: each Linear becomes weight + scales + biases) |
| `vit.safetensors` | 1.34 GB | bf16 (not quantized) |
| `vae.safetensors` | 1.41 GB | bf16 (not quantized) |
| `config.json` | – | With `quantization` block (`bits=8, group_size=64, mode=affine`) |
| `quantization_report.json` | – | Provenance + footprint stats |
| `tokenizer.json` / `vocab.json` | – | Qwen2.5-VL vocabulary |
## Architecture (same as the bf16 variant)
See [`mlx-community/Lance-3B-bf16`](https://huggingface.co/mlx-community/Lance-3B-bf16) for the full architecture description.
## License
This MLX port + quantization: **Apache 2.0**.
Underlying weights:
- Lance: Apache 2.0 (ByteDance Intelligent Creation Lab).
- Wan2.2 VAE: Apache 2.0 (Alibaba).
- Qwen2.5-VL: Apache 2.0 (Alibaba).
## Citation
```bibtex
@article{fu2026lance,
title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
author={Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and others},
journal={arXiv preprint arXiv:2605.18678},
year={2026}
}
```
## Links
- **MLX port code:** [`github.com/xocialize/lance-mlx`](https://github.com/xocialize/lance-mlx)
- **bf16 source:** [`mlx-community/Lance-3B-bf16`](https://huggingface.co/mlx-community/Lance-3B-bf16)
- **Standalone VAE:** [`mlx-community/Wan2.2-VAE-Lance-bf16`](https://huggingface.co/mlx-community/Wan2.2-VAE-Lance-bf16)
- **Video specialist (bf16, alpha 8-bit pending):** [`mlx-community/Lance-3B-Video-bf16`](https://huggingface.co/mlx-community/Lance-3B-Video-bf16)