Thanks!

#1
by lightenup - opened

Superficial testing (Python, Javascript codegen/software engineering practices) doesn't show any performance degradation compared to https://chat.z.ai/ It's a great quantization for 96 GB VRAM!

Thanks, great results on blackwell 96gb gpu , getting avg 80-90t/s with 128k context size, finally sonnet at home

Echo-ing this thanks. This model and quant is great. Any chance you might also do the 4.5V model that just released?

QuantTrio org

Absolutely

QuantTrio org

we are working on it. Stay tune!

I have been able to run this model with 128k context using vllm on 4x3090rtx. Thank you very much!

@rainbyte wat is the tokens/second ur getting at 100k context?

@hareram241 I just tested loading part of a codebase on llm client, almost 100k context, and got this output on vllm logs:

Avg prompt throughput: 9528.4 tokens/s, Avg generation throughput: 22.6 tokens/s

Analyzing the input files took a while, and then response was half of the usual tokens/sec

Is that enough info? Should I test in some different/better way?

Which version of VLLM should be used with this quantitative model in order for it to run properly? I’m using VLLM version 0.11, but I’m getting a KeyError: layers.1.mlp.experts.w2_weight. I’ve checked each of the weight files one by one, and they are all the same as those specified in the documentation.

@lsm03624 what options are you using? Here it is working with vLLM 0.11

Sign up or log in to comment