Lance-3B-8bit / README.md
xocialize's picture
Update banner: mark superseded by Lance-3B-AWQ-INT4 for VQA; reflect 5c-3h research closure
56bcf8e verified
metadata
license: apache-2.0
language:
  - en
  - zh
library_name: mlx
pipeline_tag: image-to-image
tags:
  - mlx
  - apple-silicon
  - lance
  - bytedance
  - multimodal
  - text-to-image
  - image-editing
  - vqa
  - qwen2.5-vl
  - quantized
  - 8-bit
base_model: bytedance-research/Lance

⚠️ SUPERSEDED β€” DO NOT USE. This 8-bit checkpoint produces visibly degraded t2i output (ghost subject + rainbow striped artifacts vs bf16). Kept on HF for historical reproducibility of the May 2026 quantization research record only.

What to use instead:

  • For full-quality t2i / image_edit / x2t_image: mlx-community/Lance-3B-bf16 (~15 GB)
  • For compressed x2t_image (VQA) on 8-16 GB Macs: mlx-community/Lance-3B-AWQ-INT4 (5.65 GB repo, 3.31 GB LLM, 6-9Γ— faster decode)
  • For image generation on small RAM: no quantized variant is shippable β€” use bf16 on a Mac that fits it. Phase 5c-3h showed the 80% HF detail loss is architectural (forward-pass error compounding through Lance's 2,160 evaluations per image), not a quant-scheme problem.

πŸŽ“ Quantization research closed (2026-05-26). The May 2026 effort investigated naive groupwise 4/8-bit, DWQ (4-bit UND-only), and AWQ (4-bit + 8-bit, full + UND-only) across multiple configurations. AWQ math is correct per-Linear (Phase 5c-3h empirical confirmation: -28% output MSE average at 8-bit) but per-step quant improvements don't compound through Lance's flow-matching architecture. No quant scheme tested would close the t2i gap; k-quants from llama.cpp would face the same compounding problem. Lance-3B-AWQ-INT4 is the final shipping outcome β€” VQA only. Full research record: xocialize/lance-mlx under notes/phase5n_diagnostics/phase5c3_awq_port/.


πŸ“‚ Part of the Lance MLX collection on mlx-community.

Lance-3B-8bit (MLX, image specialist, 8-bit quantized)

8-bit groupwise affine quantization of mlx-community/Lance-3B-bf16, the image-specialist Lance checkpoint. Produced via mlx-lm's quantize_model utility with a per-tower skip predicate (time_embedder, llm2vae, and vae_in_proj kept at bf16 for numerical safety; the bulk LLM weights β€” attention projections, MLP, embeddings, lm_head β€” quantized).

Status

🟒 Production-ready for image tasks on Apple Silicon as of 2026-05-21.

Capability Status Speedup vs bf16
t2i (text β†’ image) βœ… Photorealistic, prompt-aligned ~2.7Γ— faster (75 s vs 201 s for 768Β² Γ— 30 steps Γ— CFG=4.0)
image_edit (instruction-based) βœ… Identity + style preservation ~2.5Γ— faster expected
x2t_image (image VQA) βœ… Content-correct similar / faster

Memory footprint: 6.59 GB on disk (53% of the bf16 12.37 GB). Runtime RAM ~8–10 GB, comfortable on a 16 GB Mac.

Quality notes vs bf16

  • Photorealism + content fidelity preserved. Cats, dragons, portraits, etc., all generate cleanly.
  • Fine text on generated objects shows slight degradation. E.g. "STOP" on a sign may render as "SNICS" or similar near-miss. The content is otherwise correct (correct color, correct rectangular sign shape, recognizable text-like glyphs).
  • For prompts that don't require legible in-image text, output is visually indistinguishable from bf16 to a casual eye.

Quickstart

from huggingface_hub import snapshot_download
weights = snapshot_download("mlx-community/Lance-3B-8bit")

Text-to-image

from lance_mlx.pipeline.t2i import TextToImagePipeline

pipe = TextToImagePipeline.from_pretrained(
    lance_weights_dir=weights,
    vae_safetensors=f"{weights}/vae.safetensors",
)
image = pipe.generate(
    "A photorealistic tabby cat in a sunlit window.",
    height=768, width=768, num_steps=30, cfg_scale=4.0, seed=42,
)
image.save("cat.png")

Image editing + VQA

Same API as the bf16 variant β€” ImageEditPipeline and UnderstandingPipeline both pick up the quantization block in config.json automatically via lance_mlx.model._loader.load_lance_model.

What's quantized vs skipped

Component Quantization Why
embed_tokens (151,936 Γ— 2,048) βœ… 8-bit Big, tolerant
lm_head (151,936 Γ— 2,048) βœ… 8-bit Big, used in AR decode only
32 layers Γ— q/k/v/o_proj (UND) βœ… 8-bit Bulk of LLM compute
32 layers Γ— q/k/v/o_proj_moe_gen (GEN) βœ… 8-bit Bulk of GEN compute
32 layers Γ— mlp.{up,gate,down}_proj βœ… 8-bit Bulk of LLM compute
32 layers Γ— mlp_moe_gen.{up,gate,down} βœ… 8-bit Bulk of GEN compute
time_embedder.proj_in/out ❌ bf16 Timestep info, numerically sensitive
llm2vae (flow head, 2048 Γ— 48) ❌ bf16 Tiny + critical to flow prediction
vae_in_proj.vae2llm (2048 Γ— 48) ❌ bf16 Auto-skipped (input_dim 48 β‰  64*k)
latent_pos_embed.pos_embed ❌ bf16 Custom param holder, no to_quantized
All RMSNorms + QK-norms ❌ bf16 F32 / bf16 norm scales preserved
Wan2.2 VAE (encoder + decoder) ❌ bf16 Pixel fidelity matters
Qwen2.5-VL ViT ❌ bf16 Semantic fidelity matters for x2t

Recipe: 8-bit affine, group_size 64. quantization_report.json in this repo has full provenance.

Why no Video 8-bit yet

The video specialist (Lance_3B_Video) does not quantize cleanly to 8-bit with this recipe β€” t2v output collapses to a gray gradient regardless of whether the GEN tower is included or skipped, and finer group_sizes don't help. The video-specialist fine-tune has different weight distributions that affine 8-bit can't capture.

Reza2kn/lance-quant's findings suggest DWQ (dynamic weight quantization) with calibration is the right approach for Lance video at 8-bit and below. That's a Phase 5c project. For now, use mlx-community/Lance-3B-Video-bf16 at bf16 for video tasks.

Files in this repo

File Size Notes
model.safetensors 6.59 GB Quantized LLM weights (2033 tensors: each Linear becomes weight + scales + biases)
vit.safetensors 1.34 GB bf16 (not quantized)
vae.safetensors 1.41 GB bf16 (not quantized)
config.json – With quantization block (bits=8, group_size=64, mode=affine)
quantization_report.json – Provenance + footprint stats
tokenizer.json / vocab.json – Qwen2.5-VL vocabulary

Architecture (same as the bf16 variant)

See mlx-community/Lance-3B-bf16 for the full architecture description.

License

This MLX port + quantization: Apache 2.0.

Underlying weights:

  • Lance: Apache 2.0 (ByteDance Intelligent Creation Lab).
  • Wan2.2 VAE: Apache 2.0 (Alibaba).
  • Qwen2.5-VL: Apache 2.0 (Alibaba).

Citation

@article{fu2026lance,
  title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
  author={Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and others},
  journal={arXiv preprint arXiv:2605.18678},
  year={2026}
}

Links