Thanks!

by lightenup - opened Aug 4, 2025

Aug 4, 2025

Superficial testing (Python, Javascript codegen/software engineering practices) doesn't show any performance degradation compared to https://chat.z.ai/ It's a great quantization for 96 GB VRAM!

hareram241

Aug 7, 2025

Thanks, great results on blackwell 96gb gpu , getting avg 80-90t/s with 128k context size, finally sonnet at home

bakbeest

Aug 11, 2025

Echo-ing this thanks. This model and quant is great. Any chance you might also do the 4.5V model that just released?

JunHowie

QuantTrio org Aug 12, 2025

Absolutely

JunHowie

QuantTrio org Aug 12, 2025

we are working on it. Stay tune！

rainbyte

Aug 29, 2025

I have been able to run this model with 128k context using vllm on 4x3090rtx. Thank you very much!

hareram241

Aug 29, 2025

@rainbyte wat is the tokens/second ur getting at 100k context?

rainbyte

Aug 29, 2025

@hareram241 I just tested loading part of a codebase on llm client, almost 100k context, and got this output on vllm logs:

Avg prompt throughput: 9528.4 tokens/s, Avg generation throughput: 22.6 tokens/s

Analyzing the input files took a while, and then response was half of the usual tokens/sec

Is that enough info? Should I test in some different/better way?

lsm03624

Nov 1, 2025

Which version of VLLM should be used with this quantitative model in order for it to run properly? I’m using VLLM version 0.11, but I’m getting a KeyError: layers.1.mlp.experts.w2_weight. I’ve checked each of the weight files one by one, and they are all the same as those specified in the documentation.

rainbyte

Nov 8, 2025

@lsm03624 what options are you using? Here it is working with vLLM 0.11

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment