Loading + batched generation on mlx_vlm (3 fixes)

#1
by RockTalk - opened

Stock mlx_vlm (incl. the in-flight MiniMax-M3 support PR) can't serve this 4-bit quant out of the box:

  • load fails with Received 855 parameters not in model (the MoE experts ship pre-stacked as block_sparse_moe.switch_mlp.* + separate shared_experts.*, which the sanitizer didn't fuse), and
  • once loaded, concurrent / best-of-N requests hit MiniMaxM3KVCache does not yet support batching.

Three small fixes resolve both (load + batching). Submitted upstream here: https://github.com/ivanfioravanti/mlx-vlm/pull/2

Verified on an M3 Ultra (512GB): loads, generates ~23.4 tok/s, and a best-of-8 + unit-test coding bake-off scores 6/6 with batching on (CONC=4), 0 errors. Posting so others running this on Apple Silicon can find the fix.

— Rocktalk Holdings

MLX Community org

Please try again now, I re uploaded it and it should be fixed.

Sign up or log in to comment