does not appear to have a file named configuration_glm4_moe.py

#7
by hi-im-bored - opened

Error w/ both SGLang & vLLM

SError: Firworks/INTELLECT-3-nvfp4 does not appear to have a file named configuration_glm4_moe.py. Checkout 'https://huggingface.co/Firworks/INTELLECT-3-nvfp4/tree/main' for available files.

nvm this is upstream as well

hi-im-bored changed discussion status to closed

Did you try running with
-e VLLM_USE_FLASHINFER_MOE_FP4=1? I've got a sample vllm command that works on the model card. I tested this one on an RTX Pro 6000 Blackwell and it ran.

Yeah, still had the issue, just needed to apply this https://huggingface.co/PrimeIntellect/INTELLECT-3/discussions/1, after that it automatically resolved the architecture fine

On a DGX Spark w/ vllm/vllm-openai:latest, some other issues w/ e.x. not using the TRTLLM backend but again not the model's fault just quirks of new hardware ig

Ah thanks for the info. I just removed it. I checked and the base repo also has it removed. As for the Spark, I wish I had one to test these quants on. It was a big inspiration for me trying to make all these quants so I hope at least some of them work on there.

Sorry to pile more work on you but by any chance would you also be interested/willing to do speculative decoding for the model as well? https://nvidia.github.io/Model-Optimizer/guides/5_speculative_decoding.html

4.5air fp8 supports it but seems like it isn't quite compatible with the PI release

Seems interesting. I spent like 6 hours trying to make an EAGLE version of INTELLECT-3. I think I'm maybe closeish to cracking it but I have to stop for a bit. I'll keep messing with it and see if I can make one. As far as I can tell once the EAGLE weights for INTELLECT-3 exist it should be able to be used with either my NVFP4 quant or the full fat version and speed them up.

Sign up or log in to comment