Any chance of safetensors format?
Thank you for giving us a MagicQuant of Seed!
Is there any chance we can get the weights in safetensors format for better compatibility with VLLM? I want to run this with tensor parallelism across two GPUs for better speed, but VLLM's TP doesn't support the seed_oss GGUF architecture. llama.cpp's --split-mode row helps a little bit, but it still leaves a lot of performance on the table.
So sadly, MagicQuant tactics and vLLM do not mix at all. And it pains me too! But this isn't a MagicQuant vs vLLM, but instead this is the nature of GGUF vs vLLM design reality.
vLLM is fundamentally built around uniform tensor layouts. So it can do true tensor parallelism. But MagicQuant via GGUF, deliberately uses hybrid per tensor quantization, which breaks those assumptions.
That said, if someone ever figures out how to bring hybrid tensor logic into vLLM native format, that would be an incredible day.