Part of the LongCat-Video β€” MLX collection.

LongCat-Video-q4 (MLX)

4-bit quantized variant of mlx-community/LongCat-Video-bf16. Same model, same six task variants (T2V / I2V / Continuation / Refinement / Long-Video / Interactive), same cfg_step_lora + refinement_lora files β€” just with the DiT Linears quantized to 4-bit via mlx.nn.quantize for smaller-RAM Macs.

TL;DR

DiT 4-bit quantized (group_size=64, skip final_layer.linear + embedders + AdaLN)
DiT size ~9 GB (2 shards; 2.85Γ— smaller than bf16's 26 GB)
VAE / umT5 / LoRAs bf16 (unchanged from bf16-variant)
Total disk ~25 GB (vs 42 GB bf16)
Min unified memory ~32 GB recommended for 480p
Inference 50-step baseline OR 8-step with cfg_step_lora (fast)
License MIT

Quantization details

  • Method: mlx.nn.quantize(bits=4, group_size=64) β€” MLX-LM convention
  • What's quantized: every nn.Linear in the 48-block DiT EXCEPT the skip patterns below
  • Skip patterns (kept at bf16):
    • final_layer.linear β€” Meituan's documented skip
    • t_embedder. β€” TimestepEmbedder MLP (small + sensitive; feeds adaLN_modulation which would otherwise corrupt β€” see L42 in skill-lessons.md)
    • y_embedder. β€” CaptionEmbedder MLP (small + sensitive)
    • adaLN_modulation. β€” per-block AdaLN-Zero modulation (must stay floating-point β€” silent accumulation bug if quantized, L11)
  • What's NOT quantized: VAE, umT5, both LoRAs β€” they're small contributors to total disk and quantizing them would degrade output more than save space.

The runtime pipeline (longcat_video package) auto-detects the quantization block in dit/config.json and applies nn.quantize before load_weights. No user-facing API change vs. the bf16 variant.

Quick start

# 1. Pull weights (~25 GB)
hf download mlx-community/LongCat-Video-q4 --local-dir ./weights

# 2. Set up inference (Python 3.12)
git clone https://github.com/xocialize/longcat-video-mlx
cd longcat-video-mlx
python3.12 -m venv .venv
.venv/bin/pip install -e ".[parity]"

# 3. Run text-to-video β€” pass --variant q4
.venv/bin/python scripts/run_t2v.py \
    --weights ./weights/.. \
    --variant q4 \
    --prompt "A cat surfing on a wave at sunset, cinematic, 8k" \
    --num-frames 93 \
    --out output_t2v.mp4

# 4. Fast mode: --variant q4 --cfg-step-lora reduces 50 steps β†’ 8
.venv/bin/python scripts/run_t2v.py \
    --weights ./weights/.. \
    --variant q4 --cfg-step-lora \
    --prompt "A cat surfing on a wave at sunset..." \
    --num-frames 93 \
    --out output_t2v_fast.mp4

Choosing between bf16, q4, q8

Variant Disk Min RAM Quality Pick when
bf16 42 GB 64 GB reference You want the best output and have the RAM headroom
q4 25 GB 32 GB minor degradation RAM is tight; you'd rather have q4 than not run at all
q8 30 GB 48 GB very close to bf16 Best of both β€” small disk savings, near-bf16 quality

For batch generation / API serving, bf16 is the right choice β€” quality regression compounds. For exploration / personal use on a 32–64 GB Mac, q4 is the sweet spot.

License

MIT β€” matches the upstream LongCat-Video license.

Downloads last month

-

Downloads are not tracked for this model. How to track
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mlx-community/LongCat-Video-q4

Finetuned
(2)
this model

Collection including mlx-community/LongCat-Video-q4