Loading + batched generation on mlx_vlm (3 fixes)

by RockTalk - opened 6 days ago

Stock mlx_vlm (incl. the in-flight MiniMax-M3 support PR) can't serve this 4-bit quant out of the box:

load fails with Received 855 parameters not in model (the MoE experts ship pre-stacked as block_sparse_moe.switch_mlp.* + separate shared_experts.*, which the sanitizer didn't fuse), and
once loaded, concurrent / best-of-N requests hit MiniMaxM3KVCache does not yet support batching.

Three small fixes resolve both (load + batching). Submitted upstream here: https://github.com/ivanfioravanti/mlx-vlm/pull/2

Verified on an M3 Ultra (512GB): loads, generates ~23.4 tok/s, and a best-of-8 + unit-test coding bake-off scores 6/6 with batching on (CONC=4), 0 errors. Posting so others running this on Apple Silicon can find the fix.

— Rocktalk Holdings

ivanfioravanti

MLX Community org about 4 hours ago

Please try again now, I re uploaded it and it should be fixed.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment