📂 Part of the Lance MLX collection on mlx-community.

Lance-3B-Video-bf16 (MLX, video specialist) — 🚧 ALPHA

MLX port of ByteDance Intelligent Creation Lab's Lance — the video-specialist Lance_3B_Video checkpoint, converted to bf16 for Apple Silicon. ~6.44 B LLM parameters + 669 M Qwen2.5-VL ViT bundled, with the 126,976-entry latent_pos_embed table needed for video-scale latent grids.

⚠️ This is an alpha release. t2v is production-quality through 768²×25f (n_lat=16,128) after the Phase 5m CFG-renorm fix (v0.5.2), verified across two prompts (panda surfing, bus + Big Ben). At n_lat ≥ ~30k (768²×49f, 480×848×121f) Phase 5m partially closes the original "pure noise" failure to a milder "structured-but-degraded with mesh artifacts" failure — the model attempts the scene but the VAE outputs colored geometric tiles overlaid on it. See Status below.

Lance is ByteDance's 3B-active unified multimodal model (paper, code, HF original). This is not Lance/LanceDB, the columnar data format.

Status

🟡 Alpha — t2v is production-quality through 768²×25f (n_lat=16,128) after the Phase 5m fix; n_lat ≥ ~30k produces structured-but-degraded output with mesh artifacts (improvement over pre-fix pure noise, but not usable); understanding pipelines unvalidated.

Capability	Status	Notes
t2v at 256×256 × ≤17 frames	🟢 Production	Red panda surfing demo shows real temporal motion. ~33 s/clip on M5 Max.
t2v at 512×512 × 17 frames (n_lat ≤ 5,120)	🟢 Production	Painterly aesthetic (this checkpoint's training-time style).
t2v at 640×640 × 17 frames (n_lat ≤ 8,000)	🟢 Production	New scale validated under Phase 5m.
t2v at 768×768 × ≤13 frames (n_lat ≤ 9,216)	🟢 Production	Painterly; legacy baseline.
t2v at 768×768 × 17 frames (n_lat = 11,520)	🟢 Production (Phase 5m)	CFG-renorm fix in v0.5.2 — closes the silent quality regression. Pass `cfg_renorm_type="global"` to restore legacy default.
t2v at 768×768 × 25 frames (n_lat = 16,128)	🟢 Production (Phase 5m+)	Verified post-fix with clean diagnostic prompt (bus + Big Ben). Quality equivalent-or-better than 17f; the n_lat → quality relationship is stochastic (seed × scale), not a monotonic degradation curve.
t2v at 768×768 × 33-41 frames (n_lat = 21k–26k)	❓ Untested with Phase 5m	Gap in the empirical sweep. Likely in-envelope based on 25f result but unverified.
t2v at 768×768 × 49 frames (n_lat = 29,952)	❌ Structured-but-degraded	Manual verification 2026-05-23: Phase 5m partially closes the pre-fix pure-noise collapse to a milder failure — the model attempts the scene (Big Ben silhouette barely visible) but the VAE produces colored geometric mesh artifacts overlaid throughout. Numerical signature: final std=0.623 vs ~0.88 for clean runs — channel renorm clamps too aggressively at late timesteps once n_lat reaches 30k, pushing latents outside the VAE's trained distribution. ~78 min wall-clock + 84.6 GB peak memory. Tracked as issue #1.
t2v at Lance reference (480×848 × 121f, n_lat ≈ 49 k)	❌ Same regime as above	Untested directly at this exact dimension but expected to fall in the same degraded-mesh-artifact regime as 768²×49f.
x2t_video (video VQA / captioning)	🟡 Implemented, not validated	Pipeline lands in lance-mlx but hasn't been compared against Phase 0 oracle.
video_edit (instruction-based)	🟡 Implemented, not validated	Direct fusion of t2v + image_edit. Will only be as good as t2v at the chosen scale.

For production-quality image tasks (t2i, image_edit, x2t_image), use the sibling repo mlx-community/Lance-3B-bf16 — it's fully validated.

Hardware envelope (`memory_mode`, 2026-06-02)

The lance-mlx source repo's memory_mode knob (auto / parallel / relay) brings bf16 video generation within reach of 8–16 GB Apple Silicon Macs:

RAM	Mode	t2v / video_edit	Notes
8–16 GB	`relay` (auto-resolved)	✅ Long t2v clips fit (e.g. 256²×61f in ~8.7 GB)	Single-shot per pipeline load — re-prefill reloads the UND tower. Tiled spatial+temporal VAE decode keeps the decode peak roughly flat in frame count.
24 GB+	`parallel` (auto-resolved)	✅ Reusable pipeline across calls	All components resident; classic behavior.

relay produces byte-identical output to parallel (frame MD5-verified on real Lance-3B-Video-bf16) — it sheds the UND tower after prefill and frees the GEN tower before VAE decode, so peak memory ≈ heaviest single phase rather than the sum of all three. The tiled VAE decode (tile_vae=True, plus vae_temporal_tile for long clips) bounds the decode transient, so long t2v becomes loop-bound rather than decode-bound. Default auto resolves by mx.device_info()'s recommended working-set size with the split at ~18 GiB. The same envelope applies to the sibling mlx-community/Lance-3B-bf16 for image tasks.

Within the n_lat ≤ 16,128 production envelope above, relay fits every validated configuration in 16 GB; the larger-than-envelope configs (n_lat ≥ ~30k) are still gated by the structured-mesh-artifact issue documented below, not by memory.

Lossless streaming VAE decode (`lossless_decode`, 2026-06-05)

generate(..., lossless_decode=True) (default since 2026-06-05) uses a bit-identical streaming VAE decode that's flat in frame count via temporal causal-cache streaming (Wan2.2 VAE is causal in time), composed with spatial halo-tile + crop. 50-case bit-identity test in lance-mlx with a negative control verifies the guarantee (max|Δ|=0 vs whole-sequence dec(z)). On real Lance-3B-Video-bf16 with ri_phys measurement (true OS-committed footprint, supersedes the older mx.get_peak_memory() quotes which under-report ~2×):

Config	whole `dec(z)`	lossless streaming	lossy blend	16 GB
256² × 49f	12.18 GB	7.99 GB	—	✅ lossless (lighter + exact)
256² × 121f	15.41 GB	8.05 GB	12.47 GB	✅ lossless (lighter + exact)
512² × 61f	OOM	12.64 GB	OOM (>20 GB)	✅ lossless (only path that fits)
768² video (≥13f)	OOM	>~21 GB (lower bound)	13.2–16.3 GB (+~3 GB swap)	lossy on 16 GB (lossless needs 24 GB+, predicted)

The temporal streaming win widens with clip length — at 256² the lossless decode goes from 7.99 GB at 49f to 8.05 GB at 121f, essentially flat. Only 768² video needs the lossy fallback on 16 GB: pass lossless_decode=False to keep the trapezoidal-blend tiling (~~1.5–4.8 / 255 off the reference). For bit-identical 768² video, a 24 GB+ Mac is needed (predicted from the >~~21 GB lower bound, not directly measured).

Full envelope, methodology, and raw numbers in the source repo's LIMITS.md.

There is no AWQ-INT4 quantized variant for video — mlx-community/Lance-3B-AWQ-INT4 ships for x2t_image VQA only and is not applicable to t2v / video_edit / x2t_video. For video on small Macs, bf16 + memory_mode=relay (+ lossless_decode=False for 768² video) is the path.

Why a separate "Video" checkpoint?

ByteDance ships two variants of Lance that differ in fine-tuning (NOT just latent_pos_embed size):

Lance_3B — image specialist. Crystal-clear photorealistic t2i.
Lance_3B_Video — video specialist. Same architecture, further fine-tuned on video data. Native aesthetic is painterly (verified by per-tensor diff: _moe_gen QK-norms differ by 0.5–0.85 in 6+ layers; lm_head and embed_tokens are byte-identical).

This checkpoint also bundles the Qwen2.5-VL ViT for video-understanding tasks, with the larger 126,976-entry latent_pos_embed table that addresses video-resolution token grids.

Quickstart

Install from the lance-mlx source repo:

git clone https://github.com/xocialize/lance-mlx
cd lance-mlx && uv sync

Download this checkpoint:

from huggingface_hub import snapshot_download
weights = snapshot_download("mlx-community/Lance-3B-Video-bf16")

Text-to-video (recommended scale)

from lance_mlx.pipeline.t2v import TextToVideoPipeline

pipe = TextToVideoPipeline.from_pretrained(
    lance_weights_dir=weights,
    vae_safetensors=f"{weights}/vae.safetensors",
)
frames = pipe.generate(
    "A red panda surfing on a sunny wave.",
    num_frames=16, height=256, width=256,
    num_steps=30, cfg_scale=4.0, seed=42,
)
# frames is np.ndarray of shape (T_decoded, H, W, 3) uint8

Encode to MP4 with imageio:

import imageio
with imageio.get_writer("out.mp4", fps=12, codec="libx264") as writer:
    for f in frames:
        writer.append_data(f)

Video understanding (alpha)

from lance_mlx.pipeline.understanding import UnderstandingPipeline

pipe = UnderstandingPipeline.from_pretrained(
    lance_weights_dir=weights,
    vit_safetensors=f"{weights}/vit.safetensors",
)
answer = pipe.generate_video(
    video="my_video.mp4",
    question="Describe what happens in this video.",
    num_sample_frames=16, target_h=224, target_w=224,
    max_new_tokens=256, prompt_style="lance",
)
print(answer)

⚠️ Unvalidated against Phase 0 oracle. Treat answers as exploratory.

Phase 5m fix — silent quality regression at n_lat ≈ 11,520 RESOLVED (v0.5.2)

The "global" CFG-renorm cap was computing a single scalar L2 over the entire velocity tensor. At higher n_lat (≈ 2× the production baseline) the L2 sum spans roughly twice as many elements, so the same cap silently over-suppressed high-frequency detail — composition + identity correct, but textures and sky gradients degraded.

Fix (default since v0.5.2): cfg_renorm_type="channel" computes per-channel L2 separately, so pathological channels clamp without dragging the aggregate signal down. Detail returns at high n_lat without regressing small scales.

Evidence (768²×17f, seed=43, cfg=4.0): "global" final std 0.907 (over-suppressed), "none" final std 1.112 (uncapped + clean), "channel" final std 0.900 (capped per-channel + visually matches "none"). V0 safety A/B at 768²×13f confirms no small-scale regression.

Pass cfg_renorm_type="global" to restore the legacy default.

Known issue: structured-but-degraded mesh artifacts at n_lat ≥ ~30k

Lance_3B_Video t2v pre-Phase-5m collapsed to pure random noise at very-high latent counts. Post-Phase-5m the failure mode is milder but still unusable: the model attempts the scene (recognizable silhouettes barely visible) but the VAE outputs colored geometric mesh tiles overlaid throughout.

Bisection on Phase 5m defaults (cfg_renorm_type="channel"):

T_frames   n_lat   result
       1   2,304   coherent  (same as t2i)
       5   4,608   coherent
       9   6,912   coherent (painterly)
      13   9,216   coherent (painterly, mild temporal drift)
      17  11,520   coherent ← Phase 5m fix restored detail
      25  16,128   coherent ← Phase 5m+ verified across two prompts
      33  21,120   untested
      41  26,304   untested
      49  29,952   structured-but-degraded ← manual verification 2026-05-23

Numerical signature of the degraded regime: final std=0.623 (49f) vs ~0.88 (clean runs). Channel renorm clamps too aggressively at late timesteps once n_lat reaches ~30k, pushing latents outside the VAE's trained distribution. The mesh-tile pattern is the VAE's response to out-of-distribution latents — not random noise but a low-rank geometric approximation.

Open candidates for a future Phase 5n / issue #1 fix:

Per-channel renorm threshold that scales with n_lat (currently constant)
Alternative late-timestep clamping (e.g. cfg_interval=[0.4, 1.0] to disable CFG entirely in the last steps)
Investigating whether VAE decoder can be retrained on Phase-5m-style latents (longer-term)

The bug does not affect:

Image tasks (use mlx-community/Lance-3B-bf16).
t2v through 768² × 25f with Phase 5m defaults.
The model checkpoint itself — same weights produce coherent images at any resolution.

Tracked at github.com/xocialize/lance-mlx/issues/1.

Performance (M5 Max 128 GB)

Task	Configuration	Wall-clock
t2v	256² × 16f, 30 steps, CFG=4.0	~33 s
t2v	512² × 16f, 30 steps, CFG=4.0	~60 s
t2v	768² × 13f, 30 steps, CFG=4.0	~145 s

CFG doubles the forward cost since cond + uncond run sequentially. KV cache for the text prefix is a Phase 5 follow-up.

Files in this repo

File	Size	Purpose
`model.safetensors`	12.87 GB	LLM weights (1021 tensors, both UND + GEN towers, with 126,976-entry latent_pos_embed)
`vit.safetensors`	1.34 GB	Qwen2.5-VL ViT (semantic encoder for x2t_video)
`vae.safetensors`	1.41 GB	Lance's bundled Wan2.2 VAE (also available standalone as `mlx-community/Wan2.2-VAE-Lance-bf16`)
`config.json`	–	`Qwen2_5_VLForConditionalGeneration` config
`conversion_report.json`	–	Provenance
`tokenizer.json` / `vocab.json`	–	Qwen2.5-VL vocabulary

Provenance

Source: bytedance-research/Lance/Lance_3B_Video/model.safetensors (1411 tensors including bundled ViT; 6.437 B LLM + 0.669 B ViT params). Converted via scripts/02_convert.py. The bundled ViT is extracted to a sibling vit.safetensors with the vit_model. prefix stripped, matching the layout convention of the image-specialist repo.

Limitations + caveats

Aesthetic is painterly by design. Lance_3B_Video was further fine-tuned on video data; the native style is intentionally painterly, not photorealistic. Lance_3B (image specialist) is the crystal-photo checkpoint.
Pending-verification regime at n_lat ≥ ~30k (see Known issue). Phase 5m fixed the silent quality regression at intermediate n_lat (verified through 16,128 with channel renorm).
No streaming or batched generation.
English + Chinese prompts. Other languages are out of distribution.

License

This MLX port: Apache 2.0.

Underlying weights:

Lance: Apache 2.0 (ByteDance Intelligent Creation Lab).
Wan2.2 VAE: Apache 2.0 (Alibaba).
Qwen2.5-VL: Apache 2.0 (Alibaba).

See NOTICE for attribution.

Citation

@article{fu2026lance,
  title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
  author={Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and others},
  journal={arXiv preprint arXiv:2605.18678},
  year={2026}
}

Model tree for mlx-community/Lance-3B-Video-bf16

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

bytedance-research/Lance

Finetuned

(8)

this model

Collection including mlx-community/Lance-3B-Video-bf16

Lance MLX

Collection

Feature-complete MLX port of ByteDance Lance: t2i, image_edit, x2t_image, t2v, video_edit, x2t_video. • 4 items • Updated Jun 2 • 6

Paper for mlx-community/Lance-3B-Video-bf16

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Paper • 2605.18678 • Published May 18 • 79

mlx-community
/

Lance-3B-Video-bf16

Lance-3B-Video-bf16 (MLX, video specialist) — 🚧 ALPHA

Status

Hardware envelope (`memory_mode`, 2026-06-02)

Lossless streaming VAE decode (`lossless_decode`, 2026-06-05)

Why a separate "Video" checkpoint?

Quickstart

Text-to-video (recommended scale)

Video understanding (alpha)

Phase 5m fix — silent quality regression at n_lat ≈ 11,520 RESOLVED (v0.5.2)

Known issue: structured-but-degraded mesh artifacts at n_lat ≥ ~30k

Performance (M5 Max 128 GB)

Files in this repo

Provenance

Limitations + caveats

License

Citation

Links

Model tree for mlx-community/Lance-3B-Video-bf16

Collection including mlx-community/Lance-3B-Video-bf16

Lance MLX

Paper for mlx-community/Lance-3B-Video-bf16

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Lance-3B-Video-bf16 (MLX, video specialist) — 🚧 ALPHA

Status

Hardware envelope (memory_mode, 2026-06-02)

Lossless streaming VAE decode (lossless_decode, 2026-06-05)

Why a separate "Video" checkpoint?

Quickstart

Text-to-video (recommended scale)

Video understanding (alpha)

Phase 5m fix — silent quality regression at n_lat ≈ 11,520 RESOLVED (v0.5.2)

Known issue: structured-but-degraded mesh artifacts at n_lat ≥ ~30k

Performance (M5 Max 128 GB)

Files in this repo

Provenance

Limitations + caveats

License

Citation

Links

Model tree for mlx-community/Lance-3B-Video-bf16

Collection including mlx-community/Lance-3B-Video-bf16

Paper for mlx-community/Lance-3B-Video-bf16

Hardware envelope (`memory_mode`, 2026-06-02)

Lossless streaming VAE decode (`lossless_decode`, 2026-06-05)