Update banner: mark superseded by Lance-3B-AWQ-INT4 for VQA; reflect 5c-3h research closure

56bcf8e verified 12 days ago

7.81 kB

license: apache-2.0
language:
  - en
  - zh
library_name: mlx
pipeline_tag: image-to-image
tags:
  - mlx
  - apple-silicon
  - lance
  - bytedance
  - multimodal
  - text-to-image
  - image-editing
  - vqa
  - qwen2.5-vl
  - quantized
  - 8-bit
base_model: bytedance-research/Lance

⚠️ SUPERSEDED — DO NOT USE. This 8-bit checkpoint produces visibly degraded t2i output (ghost subject + rainbow striped artifacts vs bf16). Kept on HF for historical reproducibility of the May 2026 quantization research record only.

What to use instead:

For full-quality t2i / image_edit / x2t_image: mlx-community/Lance-3B-bf16 (~15 GB)

For compressed x2t_image (VQA) on 8-16 GB Macs: mlx-community/Lance-3B-AWQ-INT4 (5.65 GB repo, 3.31 GB LLM, 6-9× faster decode)

For image generation on small RAM: no quantized variant is shippable — use bf16 on a Mac that fits it. Phase 5c-3h showed the 80% HF detail loss is architectural (forward-pass error compounding through Lance's 2,160 evaluations per image), not a quant-scheme problem.

🎓 Quantization research closed (2026-05-26). The May 2026 effort investigated naive groupwise 4/8-bit, DWQ (4-bit UND-only), and AWQ (4-bit + 8-bit, full + UND-only) across multiple configurations. AWQ math is correct per-Linear (Phase 5c-3h empirical confirmation: -28% output MSE average at 8-bit) but per-step quant improvements don't compound through Lance's flow-matching architecture. No quant scheme tested would close the t2i gap; k-quants from llama.cpp would face the same compounding problem. Lance-3B-AWQ-INT4 is the final shipping outcome — VQA only. Full research record: xocialize/lance-mlx under notes/phase5n_diagnostics/phase5c3_awq_port/.

📂 Part of the Lance MLX collection on mlx-community.

Lance-3B-8bit (MLX, image specialist, 8-bit quantized)

8-bit groupwise affine quantization of mlx-community/Lance-3B-bf16, the image-specialist Lance checkpoint. Produced via mlx-lm's quantize_model utility with a per-tower skip predicate (time_embedder, llm2vae, and vae_in_proj kept at bf16 for numerical safety; the bulk LLM weights — attention projections, MLP, embeddings, lm_head — quantized).

Status

🟢 Production-ready for image tasks on Apple Silicon as of 2026-05-21.

Capability	Status	Speedup vs bf16
t2i (text → image)	✅ Photorealistic, prompt-aligned	~2.7× faster (75 s vs 201 s for 768² × 30 steps × CFG=4.0)
image_edit (instruction-based)	✅ Identity + style preservation	~2.5× faster expected
x2t_image (image VQA)	✅ Content-correct	similar / faster

Memory footprint: 6.59 GB on disk (53% of the bf16 12.37 GB). Runtime RAM ~8–10 GB, comfortable on a 16 GB Mac.

Quality notes vs bf16

Photorealism + content fidelity preserved. Cats, dragons, portraits, etc., all generate cleanly.
Fine text on generated objects shows slight degradation. E.g. "STOP" on a sign may render as "SNICS" or similar near-miss. The content is otherwise correct (correct color, correct rectangular sign shape, recognizable text-like glyphs).
For prompts that don't require legible in-image text, output is visually indistinguishable from bf16 to a casual eye.

Quickstart

from huggingface_hub import snapshot_download
weights = snapshot_download("mlx-community/Lance-3B-8bit")

Text-to-image

from lance_mlx.pipeline.t2i import TextToImagePipeline

pipe = TextToImagePipeline.from_pretrained(
    lance_weights_dir=weights,
    vae_safetensors=f"{weights}/vae.safetensors",
)
image = pipe.generate(
    "A photorealistic tabby cat in a sunlit window.",
    height=768, width=768, num_steps=30, cfg_scale=4.0, seed=42,
)
image.save("cat.png")

Image editing + VQA

Same API as the bf16 variant — ImageEditPipeline and UnderstandingPipeline both pick up the quantization block in config.json automatically via lance_mlx.model._loader.load_lance_model.

What's quantized vs skipped

Component	Quantization	Why
`embed_tokens` (151,936 × 2,048)	✅ 8-bit	Big, tolerant
`lm_head` (151,936 × 2,048)	✅ 8-bit	Big, used in AR decode only
32 layers × `q/k/v/o_proj` (UND)	✅ 8-bit	Bulk of LLM compute
32 layers × `q/k/v/o_proj_moe_gen` (GEN)	✅ 8-bit	Bulk of GEN compute
32 layers × `mlp.{up,gate,down}_proj`	✅ 8-bit	Bulk of LLM compute
32 layers × `mlp_moe_gen.{up,gate,down}`	✅ 8-bit	Bulk of GEN compute
`time_embedder.proj_in/out`	❌ bf16	Timestep info, numerically sensitive
`llm2vae` (flow head, 2048 × 48)	❌ bf16	Tiny + critical to flow prediction
`vae_in_proj.vae2llm` (2048 × 48)	❌ bf16	Auto-skipped (input_dim 48 ≠ 64*k)
`latent_pos_embed.pos_embed`	❌ bf16	Custom param holder, no `to_quantized`
All RMSNorms + QK-norms	❌ bf16	F32 / bf16 norm scales preserved
Wan2.2 VAE (encoder + decoder)	❌ bf16	Pixel fidelity matters
Qwen2.5-VL ViT	❌ bf16	Semantic fidelity matters for x2t

Recipe: 8-bit affine, group_size 64. quantization_report.json in this repo has full provenance.

Why no Video 8-bit yet

The video specialist (Lance_3B_Video) does not quantize cleanly to 8-bit with this recipe — t2v output collapses to a gray gradient regardless of whether the GEN tower is included or skipped, and finer group_sizes don't help. The video-specialist fine-tune has different weight distributions that affine 8-bit can't capture.

Reza2kn/lance-quant's findings suggest DWQ (dynamic weight quantization) with calibration is the right approach for Lance video at 8-bit and below. That's a Phase 5c project. For now, use mlx-community/Lance-3B-Video-bf16 at bf16 for video tasks.

Files in this repo

File	Size	Notes
`model.safetensors`	6.59 GB	Quantized LLM weights (2033 tensors: each Linear becomes weight + scales + biases)
`vit.safetensors`	1.34 GB	bf16 (not quantized)
`vae.safetensors`	1.41 GB	bf16 (not quantized)
`config.json`	–	With `quantization` block (`bits=8, group_size=64, mode=affine`)
`quantization_report.json`	–	Provenance + footprint stats
`tokenizer.json` / `vocab.json`	–	Qwen2.5-VL vocabulary

Architecture (same as the bf16 variant)

See mlx-community/Lance-3B-bf16 for the full architecture description.

License

This MLX port + quantization: Apache 2.0.

Underlying weights:

Lance: Apache 2.0 (ByteDance Intelligent Creation Lab).
Wan2.2 VAE: Apache 2.0 (Alibaba).
Qwen2.5-VL: Apache 2.0 (Alibaba).

Citation

@article{fu2026lance,
  title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
  author={Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and others},
  journal={arXiv preprint arXiv:2605.18678},
  year={2026}
}

mlx-community
/

Lance-3B-8bit