Routed expert tensors are half-width — model fails to load (likely a botched export)

#5
by sakamakismile - opened

Sharing this after trying to run the model end-to-end, in case it saves someone else the time — and happy to be corrected if I've missed something.

TL;DR — As uploaded, the routed-expert FFN tensors are half their required input width, so they don't form a dimensionally-valid DeepSeek-V4 expert and the model won't load. This also lines up with the 149B-vs-"284B" gap between the weights and the card/citation.

Findings

  1. All weights are bf16. The quantization_config: {quant_method: fp8} in config.json looks stale/inherited — there are no weight_scale_inv (or any fp8/fp4 scale) tensors in the repo.

  2. Per-expert shapes here are w1 [2048, 2048], w3 [2048, 2048], w2 [4096, 1024].

  3. For reference, deepseek-ai/DeepSeek-V4-Flash stores its NVFP4 experts as w1.weight_packed: uint8 [2048, 2048], which unpacks to [2048, 4096] (NVFP4 packs 2 values per byte). So the bf16 experts here carry exactly the packed dimensions — they look like the NVFP4 packed tensors written out as bf16 without being unpacked, leaving the hidden-side input halved (2048 instead of 4096).

  4. Consequence: gate/up take a 2048-dim input while hidden_size = 4096, so the expert block isn't dimensionally consistent. A DeepSeek-V4-aware llama.cpp build rejects it at load:

check_tensor_dims: tensor 'blk.0.ffn_gate_exps.weight' has wrong shape;
expected 4096, 2048, 256, got 2048, 2048, 256
  1. The parameter count corroborates it: full-width experts ≈ 284B (matching the citation title "…284B Sparse MoE…"), half-width ≈ 149B (matching this repo's reported 149.2B).

Also for anyone trying: no tokenizer/modeling files are bundled, so those have to come from the base deepseek-ai/DeepSeek-V4-Flash.

A re-exported, full-width (unpacked) version would likely load fine. Thanks for putting the work out either way.

Chunjiang Intelligence org

Feedback has been received. Thanks!

Sign up or log in to comment