good quant
it did well in my text really a good model if possiple kindly make quant for https://huggingface.co/stepfun-ai/Step-3.5-Flash people will really like it imo
it did well in my text really a good model if possiple kindly make quant for https://huggingface.co/stepfun-ai/Step-3.5-Flash people will really like it imo
@gopi87 Thanks. This one was pretty hard, took me a day of iterating the layers to finally get it tuned right, very finicky model about its quants. Agree it does seem quite good. I'll take a look
at the flash model too. I havent done it for myself because its going to be 5 tps on my rig right on the threshold of useable and due to slow tps will take a very long time to
optimize, and it will also need 128G RAM CPU which most dont have. I think this 80 3active is close to perfect layout I get 20tps on fairly old consumer hardware.
@gopi87 I did a Step-3.5-Flash quant today but performance is not acceptable to upload, it went 0-3 on the first part of my eval prompts and then I stopped testing as at that point its a beyond repair dumpster fire. It was a strong Q4_K_H sized at 117.6G with minimum quant of Q4_K_S across layers and it ran about 7.5tps on a 9900k / 4070. I am not going to upload models on my page that will not even remotely make it through my acceptance qual tests (sorry).
@gopi87 I did a Step-3.5-Flash quant today but performance is not acceptable to upload, it went 0-3 on the first part of my eval prompts and then I stopped testing as at that point its a beyond repair dumpster fire. It was a strong Q4_K_H sized at 117.6G with minimum quant of Q4_K_S across layers and it ran about 7.5tps on a 9900k / 4070. I am not going to upload models on my page that will not even remotely make it through my acceptance qual tests (sorry).
thanks for effort mate and try the no thinking version too
CUDA_VISIBLE_DEVICES="0" ./bin/llama-server
--model "/home/gopi/deepresearch-ui/model/stepfun-ai_Step-3.5-Flash-IQ4_XS-00001-of-00003.gguf"
--n-cpu-moe 49
-ngl 99
--ctx-size 40000
--threads 28
--threads-batch 28
--reasoning-budget 0
--host 0.0.0.0
--jinja
--port 8080
--temp 0.6
--top-p 0.95
i wast testing this version getting around 12t/sec
@gopi87 I did a Step-3.5-Flash quant today but performance is not acceptable to upload, it went 0-3 on the first part of my eval prompts and then I stopped testing as at that point its a beyond repair dumpster fire. It was a strong Q4_K_H sized at 117.6G with minimum quant of Q4_K_S across layers and it ran about 7.5tps on a 9900k / 4070. I am not going to upload models on my page that will not even remotely make it through my acceptance qual tests (sorry).
thanks for effort mate and try the no thinking version too
Tried no think template, no joy. The model cannot handle any kind of ambiguity in prompts and just talks to itself until it runs out of tokens on almost all prompts, its completely unusable. I'll keep it
around on my disk for a couple weeks in case somebody over in llama.cpp land finds and fixes a bug for it. There is a chance its overly sensitive to being quantized, I found Qwen3 code next 80B moe to be
very sensitive to quanitization degradation and maybe its way more sensitive at the bigger 196B moe scale.