Upload folder using huggingface_hub

fe28193 verified about 18 hours ago

6.18 kB

	---
	license: mit
	library_name: mlx
	pipeline_tag: text-to-video
	tags:
	- mlx
	- apple-silicon
	- video-generation
	- text-to-video
	- image-to-video
	- video-continuation
	- longcat
	- flow-matching
	- block-sparse-attention
	base_model:
	- meituan-longcat/LongCat-Video
	language:
	- en
	- zh
	---

	Part of the [LongCat-Video — MLX](https://huggingface.co/collections/mlx-community/longcat-video-mlx) collection.


	# LongCat-Video-bf16 (MLX)

	Apple MLX bf16 weights for [LongCat-Video](https://github.com/meituan-longcat/LongCat-Video) —
	Meituan's 13.6 B-parameter base text/image-to-video diffusion model — with the
	`cfg_step_lora` and `refinement_lora` published as separate files for
	runtime task switching.

	The same DiT checkpoint serves all six task variants:

	\| Variant \| Pipeline \| LoRAs used \|
	\|---\|---\|---\|
	\| T2V (text-to-video) \| `pipeline_t2v` \| none (baseline) or `cfg_step_lora` (fast) \|
	\| I2V (image-to-video) \| `pipeline_i2v` \| same \|
	\| Video Continuation \| `pipeline_continuation` \| same \|
	\| 720p / 30fps refinement \| `refinement.py` \| `refinement_lora` + Block Sparse Attention \|
	\| Long-Video \| (chained Continuation) \| same as Continuation \|
	\| Interactive Video \| (per-segment T2V/Continuation) \| same \|

	For the companion audio-driven Avatar 1.5 port (built from the same DiT
	architecture + audio cross-attention overlay), see
	[mlx-community/LongCat-Video-Avatar-1.5-bf16](https://huggingface.co/mlx-community/LongCat-Video-Avatar-1.5-bf16).

	## TL;DR

	\| \| \|
	\|---\|---\|
	\| Architecture \| Wan 2.1 VAE + umT5-XXL + 48-block base DiT + 2 LoRAs \|
	\| Params \| ~13.6 B DiT + ~11 B umT5 + 0.5 B VAE + 2 × ~0.6 B LoRA \|
	\| Format \| bf16, sharded safetensors (HF-style per-component subdirs) \|
	\| Disk \| ~42 GB total (26 GB DiT + 11 GB umT5 + 5.3 GB LoRAs + 242 MB VAE) \|
	\| Hardware \| Apple Silicon M-series, 64 GB+ unified memory recommended for 480p \|
	\| Inference \| 50-step baseline OR ~8-step with `cfg_step_lora` (fast); refinement adds 720p/30fps SDEdit pass \|
	\| License \| MIT (matches upstream Meituan) \|

	## Quick start

	```bash
	# 1. Pull weights (~42 GB)
	hf download mlx-community/LongCat-Video-bf16 \
	--local-dir ./weights

	# 2. Set up inference (Python 3.12)
	git clone https://github.com/xocialize/longcat-video-mlx
	cd longcat-video-mlx
	python3.12 -m venv .venv
	.venv/bin/pip install -e ".[parity]"

	# 3. Run text-to-video at 480p / 15fps
	.venv/bin/python scripts/run_t2v.py \
	--weights ./weights/.. \
	--prompt "A cat surfing on a wave at sunset, cinematic, 8k" \
	--num-frames 93 \
	--out output_t2v.mp4

	# 4. (Optional) Refinement pass to 720p / 30fps
	.venv/bin/python scripts/run_refine.py \
	--weights ./weights/.. \
	--stage1 output_t2v.npy \
	--prompt "A cat surfing on a wave at sunset, cinematic, 8k" \
	--out output_refined.mp4
	```

	## Six task variants from one DiT

	All six pipelines share the same 13.6 B DiT weights. The conditioning input
	and LoRA stack are what change:

	\| Variant \| Conditioning latent \| LoRA stack \| BSA \|
	\|---\|---\|---\|---\|
	\| T2V \| pure noise \| (optional `cfg_step_lora`) \| off \|
	\| I2V \| 1 reference frame at head \| (optional `cfg_step_lora`) \| off \|
	\| Continuation \| last N frames of prior clip \| (optional `cfg_step_lora`) \| off \|
	\| Refinement \| partial-noise on VAE-encoded upsample of coarse output \| `refinement_lora` \| on \|
	\| Long-Video \| chained Continuation segments \| inherits \| off \|
	\| Interactive \| sequenced T2V/Continuation w/ per-segment prompts \| inherits \| off \|

	## Architecture

	This is the base text-to-video port. Differences from the Avatar overlay
	that the companion repo adds:

	- No audio path — no Whisper-Large-v3 encoder, no AudioProjModel, no
	audio cross-attention in DiT blocks
	- No Reference Skip Attention — base I2V uses the reference frame as a
	motion anchor, not a persistent identity, so the Avatar-specific Q-slicing
	is not used here
	- Standard text-CFG (2-pass) — vs Avatar's 3-pass disentangled CFG
	- `scheduler_shift = 12.0` — vs Avatar's 7.0
	- Block Sparse Attention — needed only by the 720p refinement pass
	(`enable_bsa: false` in the base DiT config; the refinement script flips
	it on along with hot-swapping `refinement_lora`)

	### Block Sparse Attention details

	BSA params from the published config:

	```json
	"bsa_params": {
	"sparsity": 0.9375,
	"chunk_3d_shape_q": [4, 4, 4],
	"chunk_3d_shape_k": [4, 4, 4]
	}
	```

	Tokens are grouped into 4×4×4 = 64-token blocks along the patchified
	(T_lat, H_lat, W_lat) grid. Sparsity 0.9375 keeps 6.25% of K/V blocks per
	Q block via top-k routing on block-level mean-pooled scores. This makes
	720p attention tractable; without it the 720p second pass would be too
	expensive on Apple Silicon. (Tier A pure-MLX in this port is correctness-
	correct but not yet kernel-fast; Tier B Metal kernel is in progress.)

	## Programmatic LoRA merge

	Each LoRA can be loaded separately for fine-grained control:

	```python
	from longcat_video.pipeline_t2v import LongCatVideoT2VPipeline, T2VPipelineConfig
	from longcat_video.lora import compute_merged_delta, group_lora_tensors
	from safetensors import safe_open
	import mlx.core as mx

	pipeline = LongCatVideoT2VPipeline(...) # standard 3-component load

	# Merge cfg_step_lora for the fast path (8 steps, no CFG correction)
	lora_sd = {}
	with safe_open("weights/lora/cfg_step_lora.safetensors", framework="numpy") as f:
	for k in f.keys():
	lora_sd[k] = mx.array(f.get_tensor(k))

	# (LoRA merge helper covers both cfg_step_lora and refinement_lora —
	# load whichever path your variant uses.)
	```

	## License

	MIT — matches the upstream [LongCat-Video](https://github.com/meituan-longcat/LongCat-Video)
	license. Use of the model implies compliance with the upstream's responsible-use
	guidelines (no generation of harmful, defamatory, or non-consensual content).

	## Acknowledgements

	- [Meituan LongCat team](https://github.com/meituan-longcat) — original PT
	model + tech report
	- [ml-explore/mlx](https://github.com/ml-explore/mlx) — the framework
	- [mlx-community](https://huggingface.co/mlx-community) — collection home