Upload folder using huggingface_hub

fe28193 verified about 17 hours ago

6.18 kB

license: mit
library_name: mlx
pipeline_tag: text-to-video
tags:
  - mlx
  - apple-silicon
  - video-generation
  - text-to-video
  - image-to-video
  - video-continuation
  - longcat
  - flow-matching
  - block-sparse-attention
base_model:
  - meituan-longcat/LongCat-Video
language:
  - en
  - zh

Part of the LongCat-Video — MLX collection.

LongCat-Video-bf16 (MLX)

Apple MLX bf16 weights for LongCat-Video — Meituan's 13.6 B-parameter base text/image-to-video diffusion model — with the cfg_step_lora and refinement_lora published as separate files for runtime task switching.

The same DiT checkpoint serves all six task variants:

Variant	Pipeline	LoRAs used
T2V (text-to-video)	`pipeline_t2v`	none (baseline) or `cfg_step_lora` (fast)
I2V (image-to-video)	`pipeline_i2v`	same
Video Continuation	`pipeline_continuation`	same
720p / 30fps refinement	`refinement.py`	`refinement_lora` + Block Sparse Attention
Long-Video	(chained Continuation)	same as Continuation
Interactive Video	(per-segment T2V/Continuation)	same

For the companion audio-driven Avatar 1.5 port (built from the same DiT architecture + audio cross-attention overlay), see mlx-community/LongCat-Video-Avatar-1.5-bf16.

TL;DR


Architecture	Wan 2.1 VAE + umT5-XXL + 48-block base DiT + 2 LoRAs
Params	~13.6 B DiT + ~11 B umT5 + 0.5 B VAE + 2 × ~0.6 B LoRA
Format	bf16, sharded safetensors (HF-style per-component subdirs)
Disk	~42 GB total (26 GB DiT + 11 GB umT5 + 5.3 GB LoRAs + 242 MB VAE)
Hardware	Apple Silicon M-series, 64 GB+ unified memory recommended for 480p
Inference	50-step baseline OR ~8-step with `cfg_step_lora` (fast); refinement adds 720p/30fps SDEdit pass
License	MIT (matches upstream Meituan)

Quick start

# 1. Pull weights (~42 GB)
hf download mlx-community/LongCat-Video-bf16 \
    --local-dir ./weights

# 2. Set up inference (Python 3.12)
git clone https://github.com/xocialize/longcat-video-mlx
cd longcat-video-mlx
python3.12 -m venv .venv
.venv/bin/pip install -e ".[parity]"

# 3. Run text-to-video at 480p / 15fps
.venv/bin/python scripts/run_t2v.py \
    --weights ./weights/.. \
    --prompt "A cat surfing on a wave at sunset, cinematic, 8k" \
    --num-frames 93 \
    --out output_t2v.mp4

# 4. (Optional) Refinement pass to 720p / 30fps
.venv/bin/python scripts/run_refine.py \
    --weights ./weights/.. \
    --stage1 output_t2v.npy \
    --prompt "A cat surfing on a wave at sunset, cinematic, 8k" \
    --out output_refined.mp4

Six task variants from one DiT

All six pipelines share the same 13.6 B DiT weights. The conditioning input and LoRA stack are what change:

Variant	Conditioning latent	LoRA stack	BSA
T2V	pure noise	(optional `cfg_step_lora`)	off
I2V	1 reference frame at head	(optional `cfg_step_lora`)	off
Continuation	last N frames of prior clip	(optional `cfg_step_lora`)	off
Refinement	partial-noise on VAE-encoded upsample of coarse output	`refinement_lora`	on
Long-Video	chained Continuation segments	inherits	off
Interactive	sequenced T2V/Continuation w/ per-segment prompts	inherits	off

Architecture

This is the base text-to-video port. Differences from the Avatar overlay that the companion repo adds:

No audio path — no Whisper-Large-v3 encoder, no AudioProjModel, no audio cross-attention in DiT blocks
No Reference Skip Attention — base I2V uses the reference frame as a motion anchor, not a persistent identity, so the Avatar-specific Q-slicing is not used here
Standard text-CFG (2-pass) — vs Avatar's 3-pass disentangled CFG
scheduler_shift = 12.0 — vs Avatar's 7.0
Block Sparse Attention — needed only by the 720p refinement pass (enable_bsa: false in the base DiT config; the refinement script flips it on along with hot-swapping refinement_lora)

Block Sparse Attention details

BSA params from the published config:

"bsa_params": {
  "sparsity": 0.9375,
  "chunk_3d_shape_q": [4, 4, 4],
  "chunk_3d_shape_k": [4, 4, 4]
}

Tokens are grouped into 4×4×4 = 64-token blocks along the patchified (T_lat, H_lat, W_lat) grid. Sparsity 0.9375 keeps 6.25% of K/V blocks per Q block via top-k routing on block-level mean-pooled scores. This makes 720p attention tractable; without it the 720p second pass would be too expensive on Apple Silicon. (Tier A pure-MLX in this port is correctness- correct but not yet kernel-fast; Tier B Metal kernel is in progress.)

Programmatic LoRA merge

Each LoRA can be loaded separately for fine-grained control:

from longcat_video.pipeline_t2v import LongCatVideoT2VPipeline, T2VPipelineConfig
from longcat_video.lora import compute_merged_delta, group_lora_tensors
from safetensors import safe_open
import mlx.core as mx

pipeline = LongCatVideoT2VPipeline(...)   # standard 3-component load

# Merge cfg_step_lora for the fast path (8 steps, no CFG correction)
lora_sd = {}
with safe_open("weights/lora/cfg_step_lora.safetensors", framework="numpy") as f:
    for k in f.keys():
        lora_sd[k] = mx.array(f.get_tensor(k))

# (LoRA merge helper covers both cfg_step_lora and refinement_lora —
# load whichever path your variant uses.)

License

MIT — matches the upstream LongCat-Video license. Use of the model implies compliance with the upstream's responsible-use guidelines (no generation of harmful, defamatory, or non-consensual content).

Acknowledgements

Meituan LongCat team — original PT model + tech report
ml-explore/mlx — the framework
mlx-community — collection home