Thank you!
Great quant, as always, thank you!
The iq4_xs has no imatrix, just for you! ;p
ik updated and its working right now off of tip of main as per: https://github.com/ikawrakow/ik_llama.cpp/pull/1240
So i'll cook an imatrix and release some of those too haha
Thanks, this quant fits perfectly on 4x R9700, getting about PP 2000 tokens/sec and TG 35 tokens/sec.
I just updated the perplexity graphs, i got lucky with that iq4_xs as it is pretty good even compared to ik's newer quantization types!
It also seems to be better than the "official Int4" version that just got updated: https://huggingface.co/stepfun-ai/Step-3.5-Flash-Int4/discussions/13
I did a quick test and it seems to be able to vibe code some small c++ okay at least, though folks have been saying it is quite chatty when thinking.
It seems decent at coding, but I can't tool calling to work with llama.cpp. Getting this error, probably because of limited templating support in llama.cpp:perator(): got exception: {"error":{"code":500,"message":"\n------------\nWhile executing FilterExpression at line 55, column 63 in source:\n...- for args_name, args_value in arguments|items %}↵ {{- '<...\n ^\nError: Unknown (built-in) filter 'items' for type String","type":"server_error"}}
I've been able to use native tool calls with OpenCode using this modified version.
It's been working great for hundreds of calls with the only drawback of seeing artifact of the tool call in the TUI.
e.g.
First, let me check what's in the current directory and explore the project structure.
I'll analyze this codebase to create an AGENTS.md file with build/lint/test commands and code style guidelines.<tool_call>
<function=bash
{% macro render_content(content) %}
{%- if content is none -%}
{{- '' }}
{%- elif content is string -%}
{{- content }}
{%- elif content is mapping -%}
{{- content['value'] if 'value' in content else content['text'] }}
{%- elif content is iterable -%}
{%- for item in content -%}
{%- if item is string -%}
{{- item }}
{%- elif item.type == 'text' -%}
{{- item['value'] if 'value' in item else item['text'] }}
{%- elif item.type == 'image' -%}
<im_patch>
{%- endif -%}
{%- endfor -%}
{%- endif -%}
{% endmacro %}
{{- bos_token }}
{%- if tools %}
{{- '<|im_start|>system\n' }}
{%- if messages[0].role == 'system' %}
{{- render_content(messages[0].content) + '\n\n' }}
{%- endif %}
{{- "# Tools\n\nYou have access to the following functions in JSONSchema format:\n\n<tools>" }}
{%- for tool in tools %}
{{- "\n" }}
{{- tool | tojson(ensure_ascii=False) }}
{%- endfor %}
{{- "\n</tools>\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...>\n...\n</function> block must be nested within <tool_call>\n...\n</tool_call> XML tags\n- Required parameters MUST be specified\n</IMPORTANT><|im_end|>\n" }}
{%- else %}
{%- if messages[0].role == 'system' %}
{{- '<|im_start|>system\n' + render_content(messages[0].content) + '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- set ns = namespace(last_query_index=0) %}
{%- for message in messages %}
{%- if message.role == "user" %}
{%- set content_str = render_content(message.content) %}
{%- if content_str is string and not(content_str.startswith('<tool_response>') and content_str.endswith('</tool_response>')) %}
{%- set ns.last_query_index = loop.index0 %}
{%- endif %}
{%- endif %}
{%- endfor %}
{%- for message in messages %}
{%- set content = render_content(message.content) %}
{%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
{%- set role_name = 'observation' if (message.role == "system" and not loop.first and message.name == 'observation') else message.role %}
{{- '<|im_start|>' + role_name + '\n' + content + '<|im_end|>' + '\n' }}
{%- elif message.role == "assistant" %}
{%- if message.reasoning_content is string %}
{%- set reasoning_content = render_content(message.reasoning_content) %}
{%- else %}
{%- if '</think>' in content %}
{%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
{%- set content = content.split('</think>')[-1].lstrip('\n') %}
{%- else %}
{%- set reasoning_content = '' %}
{%- endif %}
{%- endif %}
{%- if loop.index0 > ns.last_query_index %}
{{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content + '\n</think>\n' + content }}
{%- else %}
{{- '<|im_start|>' + message.role + '\n' + content }}
{%- endif %}
{%- if message.tool_calls %}
{%- for tool_call in message.tool_calls %}
{%- if tool_call.function is defined %}
{%- set tool_call = tool_call.function %}
{%- endif %}
{{- '<tool_call>\n<function=' + tool_call.name + '>\n' }}
{%- if tool_call.arguments is defined %}
{%- set arguments = tool_call.arguments %}
{# FIX: Removed fromjson, use .items(), added mapping check #}
{%- if arguments is mapping %}
{%- for args_name, args_value in arguments.items() %}
{{- '<parameter=' + args_name + '>\n' }}
{%- set args_value = args_value | tojson(ensure_ascii=False) | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
{{- args_value }}
{{- '\n</parameter>\n' }}
{%- endfor %}
{%- endif %}
{%- endif %}
{{- '</function>\n</tool_call>' }}
{%- endfor %}
{%- endif %}
{{- '<|im_end|>\n' }}
{%- elif message.role == "tool" %}
{%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
{{- '<|im_start|>tool_response\n' }}
{%- endif %}
{{- '<tool_response>' }}
{{- content }}
{{- '</tool_response>' }}
{%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
{{- '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|im_start|>assistant\n<think>\n' }}
{%- endif %}
Yeah I noticed it was giving similar issues on mainline for me too, and the recommendation I've heard from pwilkin is to try this branch: https://github.com/ggml-org/llama.cpp/pull/18675
Thanks for sharing yours, its been tricky with all the new models getting the chat templates just right to work with the web interface for simple chats and also all the various clients. I'm doing testing with pydantic-ai mostly.
I've had good luck in a couple simple tests with the default chat template on ik_llama.cpp, but if I hit any snags I'll try your template jinja out!!
Super thanks for testing y'all, I linked to this discussion in the model card! Cheers!
I saw your post over here: https://github.com/ggml-org/llama.cpp/pull/19283#issuecomment-3868260203
Have you tried the chat template above? Or have you tried running mainline llama.cpp with pwilkin's autoparser branch: https://github.com/ggml-org/llama.cpp/pull/18675
iirc you have a bunch of MI50 GPUs and did a post about your rig? A guy was asking on reddit about MI50s and how they are supported on ik_llama.cpp, I figured you're the expert haha: https://www.reddit.com/r/LocalLLaMA/comments/1r91akx/comment/o6a5yph/
Hi @kerrmetric ! I played with four MI50s for a few month, here is what I found...
The fastest quants for these cards are Q4_0 or Q4_1, because they are less taxing on compute, which is what these cards really lack. When you see high performant MI50 benchrmarks shared on reddit, it's always those quants. If you deviate from them, expect a 2x inference speed drop.
Due to the lack of compute power, expect a significant performance drop with larger context sizes. You may see good tokens per second with a small context only to watch it plummet as the context growth.
- with llama.cpp, the best use I got for these cards is offloading MoE while keeping attention on an RTX 5090, roughly comparable in speed to MoE offloading on an 8 channel DDR5 system
- I got good speed from these cards with tensor-parallel setup on vllm, but vllm is very hands on with manual patching and there are very few models that work with MI50s (try the gfx906 branch)
- ik_llama doesn't supports ROCm, thought it may have some Vulkan support, never tried it with MI50s
I ended up upgrading to four R9700 instead, and they work much better overall with good vllm support and fewer issues.
Appriciate the response! I wonder if recent work to improve rocm performance has paid off (I also do a hacky merge of a gfx906 fork with head https://github.com/iacopPBK/llama.cpp-gfx906). I'm getting roughly the same performance with Q4_0 and Q4_K_M at this point on dense models like Gemma 3 27B. And I'm getting a frankly inexplicable 80 tokens per second with long context lengths on gpt-oss-120B (it's faster than Qwen 3 30B-3A for me). I've never gotten the vllm gfx906 branch to work saddly.
I saw your post over here: https://github.com/ggml-org/llama.cpp/pull/19283#issuecomment-3868260203
Have you tried the chat template above? Or have you tried running mainline llama.cpp with pwilkin's autoparser branch: https://github.com/ggml-org/llama.cpp/pull/18675
Thanks for asking :)
I had issue with the proposal above in the PR, and issues with the autoparser branch at that time.
EDIT: oh you speak about the one here that I missed, let me check...
Now I'm trying it again with autoparser:
- Qwen3-Coder-Next + opencode (1.2.6) : works nicely
- Step-3.5-Flash:
- with
--reasoning-budget 0: outputs thinking tokens as answer, so broken even without testing tool calling - without: too much reasoning and with my 20 tok/sec I give up on this
- with
Sounds like Step-3.5-Flash requires thinking enabled to work with agentic coding client then? I wonder if there is a way to leave reasoning on so it can do tool calls, but to get it to think "less" possibly using a partially pre-filled thinking response or something.
I also got some new quants up for https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF but might be too big or slower than Step-3.5-Flash for your rig.
Oh, the above template works with Step-3.5 + --reasoning-budget 0 in agentic mode (opencode) \o/
Using your IQ4_XS quant.