Update banner: mark superseded by Lance-3B-AWQ-INT4 for VQA; reflect 5c-3h research closure

56bcf8e verified 12 days ago

7.81 kB

	---
	license: apache-2.0
	language:
	- en
	- zh
	library_name: mlx
	pipeline_tag: image-to-image
	tags:
	- mlx
	- apple-silicon
	- lance
	- bytedance
	- multimodal
	- text-to-image
	- image-editing
	- vqa
	- qwen2.5-vl
	- quantized
	- 8-bit
	base_model: bytedance-research/Lance
	---

	> ⚠️ SUPERSEDED — DO NOT USE. This 8-bit checkpoint produces visibly degraded t2i
	> output (ghost subject + rainbow striped artifacts vs bf16). Kept on HF for historical
	> reproducibility of the May 2026 quantization research record only.
	>
	> What to use instead:
	> - For full-quality `t2i` / `image_edit` / `x2t_image`: [`mlx-community/Lance-3B-bf16`](https://huggingface.co/mlx-community/Lance-3B-bf16) (~15 GB)
	> - For compressed `x2t_image` (VQA) on 8-16 GB Macs: [`mlx-community/Lance-3B-AWQ-INT4`](https://huggingface.co/mlx-community/Lance-3B-AWQ-INT4) (5.65 GB repo, 3.31 GB LLM, 6-9× faster decode)
	> - For image generation on small RAM: no quantized variant is shippable — use bf16 on a Mac that fits it. Phase 5c-3h showed the 80% HF detail loss is architectural (forward-pass error compounding through Lance's 2,160 evaluations per image), not a quant-scheme problem.

	> 🎓 Quantization research closed (2026-05-26). The May 2026 effort
	> investigated naive groupwise 4/8-bit, DWQ (4-bit UND-only), and AWQ
	> (4-bit + 8-bit, full + UND-only) across multiple configurations. AWQ math
	> is correct per-Linear (Phase 5c-3h empirical confirmation: -28% output MSE
	> average at 8-bit) but per-step quant improvements don't compound through
	> Lance's flow-matching architecture. No quant scheme tested would close
	> the t2i gap; k-quants from llama.cpp would face the same compounding
	> problem. Lance-3B-AWQ-INT4 is the final shipping outcome — VQA only.
	> Full research record: [`xocialize/lance-mlx`](https://github.com/xocialize/lance-mlx)
	> under `notes/phase5n_diagnostics/phase5c3_awq_port/`.

	---


	> 📂 Part of the [Lance MLX collection](https://huggingface.co/collections/mlx-community/lance-mlx-6a0f3cd5648a74f8283fc8a4) on mlx-community.

	# Lance-3B-8bit (MLX, image specialist, 8-bit quantized)

	8-bit groupwise affine quantization of [`mlx-community/Lance-3B-bf16`](https://huggingface.co/mlx-community/Lance-3B-bf16), the image-specialist Lance checkpoint. Produced via mlx-lm's `quantize_model` utility with a per-tower skip predicate (`time_embedder`, `llm2vae`, and `vae_in_proj` kept at bf16 for numerical safety; the bulk LLM weights — attention projections, MLP, embeddings, lm_head — quantized).

	## Status

	🟢 Production-ready for image tasks on Apple Silicon as of 2026-05-21.

	\| Capability \| Status \| Speedup vs bf16 \|
	\|---\|---\|---\|
	\| t2i (text → image) \| ✅ Photorealistic, prompt-aligned \| ~2.7× faster (75 s vs 201 s for 768² × 30 steps × CFG=4.0) \|
	\| image_edit (instruction-based) \| ✅ Identity + style preservation \| ~2.5× faster expected \|
	\| x2t_image (image VQA) \| ✅ Content-correct \| similar / faster \|

	Memory footprint: 6.59 GB on disk (53% of the bf16 12.37 GB). Runtime RAM ~8–10 GB, comfortable on a 16 GB Mac.

	## Quality notes vs bf16

	- Photorealism + content fidelity preserved. Cats, dragons, portraits, etc., all generate cleanly.
	- Fine text on generated objects shows slight degradation. E.g. "STOP" on a sign may render as "SNICS" or similar near-miss. The content is otherwise correct (correct color, correct rectangular sign shape, recognizable text-like glyphs).
	- For prompts that don't require legible in-image text, output is visually indistinguishable from bf16 to a casual eye.

	## Quickstart

	```python
	from huggingface_hub import snapshot_download
	weights = snapshot_download("mlx-community/Lance-3B-8bit")
	```

	### Text-to-image

	```python
	from lance_mlx.pipeline.t2i import TextToImagePipeline

	pipe = TextToImagePipeline.from_pretrained(
	lance_weights_dir=weights,
	vae_safetensors=f"{weights}/vae.safetensors",
	)
	image = pipe.generate(
	"A photorealistic tabby cat in a sunlit window.",
	height=768, width=768, num_steps=30, cfg_scale=4.0, seed=42,
	)
	image.save("cat.png")
	```

	### Image editing + VQA

	Same API as the bf16 variant — `ImageEditPipeline` and `UnderstandingPipeline` both pick up the `quantization` block in `config.json` automatically via `lance_mlx.model._loader.load_lance_model`.

	## What's quantized vs skipped

	\| Component \| Quantization \| Why \|
	\|---\|---\|---\|
	\| `embed_tokens` (151,936 × 2,048) \| ✅ 8-bit \| Big, tolerant \|
	\| `lm_head` (151,936 × 2,048) \| ✅ 8-bit \| Big, used in AR decode only \|
	\| 32 layers × `q/k/v/o_proj` (UND) \| ✅ 8-bit \| Bulk of LLM compute \|
	\| 32 layers × `q/k/v/o_proj_moe_gen` (GEN) \| ✅ 8-bit \| Bulk of GEN compute \|
	\| 32 layers × `mlp.{up,gate,down}_proj` \| ✅ 8-bit \| Bulk of LLM compute \|
	\| 32 layers × `mlp_moe_gen.{up,gate,down}` \| ✅ 8-bit \| Bulk of GEN compute \|
	\| `time_embedder.proj_in/out` \| ❌ bf16 \| Timestep info, numerically sensitive \|
	\| `llm2vae` (flow head, 2048 × 48) \| ❌ bf16 \| Tiny + critical to flow prediction \|
	\| `vae_in_proj.vae2llm` (2048 × 48) \| ❌ bf16 \| Auto-skipped (input_dim 48 ≠ 64*k) \|
	\| `latent_pos_embed.pos_embed` \| ❌ bf16 \| Custom param holder, no `to_quantized` \|
	\| All RMSNorms + QK-norms \| ❌ bf16 \| F32 / bf16 norm scales preserved \|
	\| Wan2.2 VAE (encoder + decoder) \| ❌ bf16 \| Pixel fidelity matters \|
	\| Qwen2.5-VL ViT \| ❌ bf16 \| Semantic fidelity matters for x2t \|

	Recipe: 8-bit affine, group_size 64. `quantization_report.json` in this repo has full provenance.

	## Why no Video 8-bit yet

	The video specialist (`Lance_3B_Video`) does not quantize cleanly to 8-bit with this recipe — t2v output collapses to a gray gradient regardless of whether the GEN tower is included or skipped, and finer group_sizes don't help. The video-specialist fine-tune has different weight distributions that affine 8-bit can't capture.

	Reza2kn/lance-quant's findings suggest DWQ (dynamic weight quantization) with calibration is the right approach for Lance video at 8-bit and below. That's a Phase 5c project. For now, use [`mlx-community/Lance-3B-Video-bf16`](https://huggingface.co/mlx-community/Lance-3B-Video-bf16) at bf16 for video tasks.

	## Files in this repo

	\| File \| Size \| Notes \|
	\|---\|---\|---\|
	\| `model.safetensors` \| 6.59 GB \| Quantized LLM weights (2033 tensors: each Linear becomes weight + scales + biases) \|
	\| `vit.safetensors` \| 1.34 GB \| bf16 (not quantized) \|
	\| `vae.safetensors` \| 1.41 GB \| bf16 (not quantized) \|
	\| `config.json` \| – \| With `quantization` block (`bits=8, group_size=64, mode=affine`) \|
	\| `quantization_report.json` \| – \| Provenance + footprint stats \|
	\| `tokenizer.json` / `vocab.json` \| – \| Qwen2.5-VL vocabulary \|

	## Architecture (same as the bf16 variant)

	See [`mlx-community/Lance-3B-bf16`](https://huggingface.co/mlx-community/Lance-3B-bf16) for the full architecture description.

	## License

	This MLX port + quantization: Apache 2.0.

	Underlying weights:
	- Lance: Apache 2.0 (ByteDance Intelligent Creation Lab).
	- Wan2.2 VAE: Apache 2.0 (Alibaba).
	- Qwen2.5-VL: Apache 2.0 (Alibaba).

	## Citation

	```bibtex
	@article{fu2026lance,
	title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
	author={Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and others},
	journal={arXiv preprint arXiv:2605.18678},
	year={2026}
	}
	```

	## Links

	- MLX port code: [`github.com/xocialize/lance-mlx`](https://github.com/xocialize/lance-mlx)
	- bf16 source: [`mlx-community/Lance-3B-bf16`](https://huggingface.co/mlx-community/Lance-3B-bf16)
	- Standalone VAE: [`mlx-community/Wan2.2-VAE-Lance-bf16`](https://huggingface.co/mlx-community/Wan2.2-VAE-Lance-bf16)
	- Video specialist (bf16, alpha 8-bit pending): [`mlx-community/Lance-3B-Video-bf16`](https://huggingface.co/mlx-community/Lance-3B-Video-bf16)