Thanks!
Superficial testing (Python, Javascript codegen/software engineering practices) doesn't show any performance degradation compared to https://chat.z.ai/ It's a great quantization for 96 GB VRAM!
Thanks, great results on blackwell 96gb gpu , getting avg 80-90t/s with 128k context size, finally sonnet at home
Echo-ing this thanks. This model and quant is great. Any chance you might also do the 4.5V model that just released?
Absolutely
we are working on it. Stay tune!
I have been able to run this model with 128k context using vllm on 4x3090rtx. Thank you very much!
@hareram241 I just tested loading part of a codebase on llm client, almost 100k context, and got this output on vllm logs:
Avg prompt throughput: 9528.4 tokens/s, Avg generation throughput: 22.6 tokens/s
Analyzing the input files took a while, and then response was half of the usual tokens/sec
Is that enough info? Should I test in some different/better way?
Which version of VLLM should be used with this quantitative model in order for it to run properly? I’m using VLLM version 0.11, but I’m getting a KeyError: layers.1.mlp.experts.w2_weight. I’ve checked each of the weight files one by one, and they are all the same as those specified in the documentation.