Part of the LongCat-Video β€” MLX collection.

LongCat-Video-bf16 (MLX)

Apple MLX bf16 weights for LongCat-Video β€” Meituan's 13.6 B-parameter base text/image-to-video diffusion model β€” with the cfg_step_lora and refinement_lora published as separate files for runtime task switching.

The same DiT checkpoint serves all six task variants:

Variant Pipeline LoRAs used
T2V (text-to-video) pipeline_t2v none (baseline) or cfg_step_lora (fast)
I2V (image-to-video) pipeline_i2v same
Video Continuation pipeline_continuation same
720p / 30fps refinement refinement.py refinement_lora + Block Sparse Attention
Long-Video (chained Continuation) same as Continuation
Interactive Video (per-segment T2V/Continuation) same

For the companion audio-driven Avatar 1.5 port (built from the same DiT architecture + audio cross-attention overlay), see mlx-community/LongCat-Video-Avatar-1.5-bf16.

TL;DR

Architecture Wan 2.1 VAE + umT5-XXL + 48-block base DiT + 2 LoRAs
Params ~13.6 B DiT + ~11 B umT5 + 0.5 B VAE + 2 Γ— ~0.6 B LoRA
Format bf16, sharded safetensors (HF-style per-component subdirs)
Disk ~42 GB total (26 GB DiT + 11 GB umT5 + 5.3 GB LoRAs + 242 MB VAE)
Hardware Apple Silicon M-series, 64 GB+ unified memory recommended for 480p
Inference 50-step baseline OR ~8-step with cfg_step_lora (fast); refinement adds 720p/30fps SDEdit pass
License MIT (matches upstream Meituan)

Quick start

# 1. Pull weights (~42 GB)
hf download mlx-community/LongCat-Video-bf16 \
    --local-dir ./weights

# 2. Set up inference (Python 3.12)
git clone https://github.com/xocialize/longcat-video-mlx
cd longcat-video-mlx
python3.12 -m venv .venv
.venv/bin/pip install -e ".[parity]"

# 3. Run text-to-video at 480p / 15fps
.venv/bin/python scripts/run_t2v.py \
    --weights ./weights/.. \
    --prompt "A cat surfing on a wave at sunset, cinematic, 8k" \
    --num-frames 93 \
    --out output_t2v.mp4

# 4. (Optional) Refinement pass to 720p / 30fps
.venv/bin/python scripts/run_refine.py \
    --weights ./weights/.. \
    --stage1 output_t2v.npy \
    --prompt "A cat surfing on a wave at sunset, cinematic, 8k" \
    --out output_refined.mp4

Six task variants from one DiT

All six pipelines share the same 13.6 B DiT weights. The conditioning input and LoRA stack are what change:

Variant Conditioning latent LoRA stack BSA
T2V pure noise (optional cfg_step_lora) off
I2V 1 reference frame at head (optional cfg_step_lora) off
Continuation last N frames of prior clip (optional cfg_step_lora) off
Refinement partial-noise on VAE-encoded upsample of coarse output refinement_lora on
Long-Video chained Continuation segments inherits off
Interactive sequenced T2V/Continuation w/ per-segment prompts inherits off

Architecture

This is the base text-to-video port. Differences from the Avatar overlay that the companion repo adds:

  • No audio path β€” no Whisper-Large-v3 encoder, no AudioProjModel, no audio cross-attention in DiT blocks
  • No Reference Skip Attention β€” base I2V uses the reference frame as a motion anchor, not a persistent identity, so the Avatar-specific Q-slicing is not used here
  • Standard text-CFG (2-pass) β€” vs Avatar's 3-pass disentangled CFG
  • scheduler_shift = 12.0 β€” vs Avatar's 7.0
  • Block Sparse Attention β€” needed only by the 720p refinement pass (enable_bsa: false in the base DiT config; the refinement script flips it on along with hot-swapping refinement_lora)

Block Sparse Attention details

BSA params from the published config:

"bsa_params": {
  "sparsity": 0.9375,
  "chunk_3d_shape_q": [4, 4, 4],
  "chunk_3d_shape_k": [4, 4, 4]
}

Tokens are grouped into 4Γ—4Γ—4 = 64-token blocks along the patchified (T_lat, H_lat, W_lat) grid. Sparsity 0.9375 keeps 6.25% of K/V blocks per Q block via top-k routing on block-level mean-pooled scores. This makes 720p attention tractable; without it the 720p second pass would be too expensive on Apple Silicon. (Tier A pure-MLX in this port is correctness- correct but not yet kernel-fast; Tier B Metal kernel is in progress.)

Programmatic LoRA merge

Each LoRA can be loaded separately for fine-grained control:

from longcat_video.pipeline_t2v import LongCatVideoT2VPipeline, T2VPipelineConfig
from longcat_video.lora import compute_merged_delta, group_lora_tensors
from safetensors import safe_open
import mlx.core as mx

pipeline = LongCatVideoT2VPipeline(...)   # standard 3-component load

# Merge cfg_step_lora for the fast path (8 steps, no CFG correction)
lora_sd = {}
with safe_open("weights/lora/cfg_step_lora.safetensors", framework="numpy") as f:
    for k in f.keys():
        lora_sd[k] = mx.array(f.get_tensor(k))

# (LoRA merge helper covers both cfg_step_lora and refinement_lora β€”
# load whichever path your variant uses.)

License

MIT β€” matches the upstream LongCat-Video license. Use of the model implies compliance with the upstream's responsible-use guidelines (no generation of harmful, defamatory, or non-consensual content).

Acknowledgements

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlx-community/LongCat-Video-bf16

Finetuned
(2)
this model
Finetunes
2 models

Collection including mlx-community/LongCat-Video-bf16