How to use it with llama-server ?

#1
by mancub - opened

I'm not having much luck running your quant through a llama-server into my VSCode - I keep getting back from the model:

Mistral 7B is a large language model created by Mistral AI.\n\nI'm trained to be secure, harmless and honest.\n\nMistral AI is a cutting-edge AI lab that trains models with outstanding performance on various benchmarks. I am their first published model.\n\nYou can learn more about me on my website: https://mistral.ai.\n\nNow, how can I help you?\n\n

adding --chat-template mistral or using your provided template --chat-template-file chat_template.jinja does not yield any difference either.

Any thoughts?

Im using llama-swap (I guess same thing)

devstral-opus-q5:
cmd: >
/media/adam/ubuntu_d/Apps/llama.cpp/build/bin/llama-server
-m "/media/adam/ubuntu_d/unsloth/Devstral-Small-2-24B-textonly_gguf/Devstral-Small-2-24B-Opus-Reasoning.Q5_K_M.gguf"
--alias devstral-opus-q5
--host 0.0.0.0 --port ${PORT}
--ctx-size 64000
--slot-save-path /media/adam/ubuntu_d/Apps/llama-swap/kv_cache/
-ngl 99
-fa on
--parallel 1
--batch-size 4096
--ubatch-size 2048
-ctk q4_0 -ctv q4_0
--temp 0.6 --top-p 0.95 --top-k 40 --min-p 0.05 --repeat-penalty 1.15
--defrag-thold 0.1
--cache-reuse 256
--chat-template-file /media/adam/ubuntu_d/unsloth/Devstral-Small-2-24B-textonly_gguf/chat_template.jinja
proxy: http://127.0.0.1:${PORT}

this is the jina template if its not working ask AI to help you for your use case

{#- Default system message if no system prompt is passed. #}
{%- set default_system_message = 'Think carefully step by step inside tags before giving your answer.' %}

{#- Begin of sequence token. #}
{{- bos_token }}

{#- Handle system prompt if it exists. #}
{%- if messages[0]['role'] == 'system' %}
{{- '[SYSTEM_PROMPT]' -}}
{%- if messages[0]['content'] is string %}
{{- messages[0]['content'] + '\n' + default_system_message -}}
{%- else %}
{%- for block in messages[0]['content'] %}
{%- if block['type'] == 'text' %}
{{- block['text'] }}
{%- endif %}
{%- endfor %}
{{- '\n' + default_system_message -}}
{%- endif %}
{{- '[/SYSTEM_PROMPT]' -}}
{%- set loop_messages = messages[1:] %}
{%- else %}
{%- set loop_messages = messages %}
{%- if default_system_message != '' %}
{{- '[SYSTEM_PROMPT]' + default_system_message + '[/SYSTEM_PROMPT]' }}
{%- endif %}
{%- endif %}

{#- Tools definition #}
{%- set tools_definition = '' %}
{%- set has_tools = false %}
{%- if tools is defined and tools is not none and tools|length > 0 %}
{%- set has_tools = true %}
{%- set tools_definition = '[AVAILABLE_TOOLS]' + (tools| tojson) + '[/AVAILABLE_TOOLS]' %}
{{- tools_definition }}
{%- endif %}

{#- [MODIFIED] Validation block removed to prevent 500 errors -#}

{#- Handle conversation messages. #}
{%- for message in loop_messages %}

{#- User messages supports text content or text and image chunks. #}
{%- if message['role'] == 'user' %}
    {%- if message['content'] is string %}
        {{- '[INST]' + message['content'] + '[/INST]' }}
    {%- elif message['content'] | length > 0 %}
        {{- '[INST]' }}
        {%- if message['content'] | length == 2 %}
            {%- set blocks = message['content'] | sort(attribute='type') %}
        {%- else %}
            {%- set blocks = message['content'] %}
        {%- endif %}
        {%- for block in blocks %}
            {%- if block['type'] == 'text' %}
                {{- block['text'] }}
            {%- elif block['type'] in ['image', 'image_url'] %}
                {{- '[IMG]' }}
            {%- endif %}
        {%- endfor %}
        {{- '[/INST]' }}
    {%- endif %}

{#- Assistant messages supports text content or text and image chunks. #}
{%- elif message['role'] == 'assistant' %}
    {%- if message['content'] is string %}
        {{- message['content'] }}
    {%- elif message['content'] | length > 0 %}
        {%- for block in message['content'] %}
            {%- if block['type'] == 'text' %}
                {{- block['text'] }}
            {%- endif %}
        {%- endfor %}
    {%- endif %}

    {%- if message['tool_calls'] is defined and message['tool_calls'] is not none and message['tool_calls']|length > 0 %}
        {%- for tool in message['tool_calls'] %}
            {%- set arguments = tool['function']['arguments'] %}
            {%- if arguments is not string %}
                {%- set arguments = arguments|tojson|safe %}
            {%- elif arguments == '' %}
                {%- set arguments = '{}' %}
            {%- endif %}
            {{- '[TOOL_CALLS]' + tool['function']['name'] + '[ARGS]' + arguments }}
        {%- endfor %}
    {%- endif %}

    {#- End of sequence token for each assistant messages. #}
    {{- eos_token }}

{#- Tool messages only supports text content. #}
{%- elif message['role'] == 'tool' %}
    {{- '[TOOL_RESULTS]' + message['content']|string + '[/TOOL_RESULTS]' }}

{%- endif %}

{%- endfor %}

Im using llama-swap (I guess same thing)

devstral-opus-q5:
cmd: >
/media/adam/ubuntu_d/Apps/llama.cpp/build/bin/llama-server
-m "/media/adam/ubuntu_d/unsloth/Devstral-Small-2-24B-textonly_gguf/Devstral-Small-2-24B-Opus-Reasoning.Q5_K_M.gguf"
--alias devstral-opus-q5
--host 0.0.0.0 --port ${PORT}
--ctx-size 64000
--slot-save-path /media/adam/ubuntu_d/Apps/llama-swap/kv_cache/
-ngl 99
-fa on
--parallel 1
--batch-size 4096
--ubatch-size 2048
-ctk q4_0 -ctv q4_0
--temp 0.6 --top-p 0.95 --top-k 40 --min-p 0.05 --repeat-penalty 1.15
--defrag-thold 0.1
--cache-reuse 256
--chat-template-file /media/adam/ubuntu_d/unsloth/Devstral-Small-2-24B-textonly_gguf/chat_template.jinja
proxy: http://127.0.0.1:${PORT}

Thanks for posting your chat template and the startup command line !

I asked Qwen3.5-35B to fix up the template a bit, not sure if it really made it better or not πŸ˜„ :

{#- Default system message if no system prompt is passed. #}
{%- set default_system_message = 'Think carefully step by step inside tags before giving your answer.' %}

{#- Begin of sequence token. #}
{{- bos_token }}

{#- Handle system prompt if it exists. #}
{%- if messages[0]['role'] == 'system' %}
    {{- '[SYSTEM_PROMPT]' }}
    {%- if messages[0]['content'] is string %}
        {{- messages[0]['content'] + '\n' + default_system_message -}}
    {%- else %}
        {%- for block in messages[0]['content'] %}
            {%- if block['type'] == 'text' %}
                {{- block['text'] }}
            {%- endif %}
        {%- endfor %}
        {{- '\n' + default_system_message -}}
    {%- endif %}
    {{- '[/SYSTEM_PROMPT]' }}
    {%- set loop_messages = messages[1:] %}
{%- else %}
    {%- set loop_messages = messages %}
    {%- if default_system_message != '' %}
        {{- '[SYSTEM_PROMPT]' + default_system_message + '[/SYSTEM_PROMPT]' }}
    {%- endif %}
{%- endif %}

{#- Tools definition #}
{%- set tools_definition = '' %}
{%- if tools is defined and tools is not none and tools|length > 0 %}
    {{- '[AVAILABLE_TOOLS]' }}
    {{- (tools|tojson)|string }}
    {{- '[/AVAILABLE_TOOLS]' }}
{%- endif %}

{#- Handle conversation messages. #}
{%- for message in loop_messages %}
    {#- User messages supports text content or text and image chunks. #}
    {%- if message['role'] == 'user' %}
        {{- '[INST]' }}
        {%- if message['content'] is string %}
            {{- message['content'] }}
        {%- elif message['content'] is iterable and message['content']|length > 0 %}
            {%- for block in message['content'] %}
                {%- if block['type'] == 'text' %}
                    {{- block['text'] }}
                {%- elif block['type'] in ['image', 'image_url', 'image_data'] %}
                    {{- '[IMG]' }}
                {%- endif %}
            {%- endfor %}
        {%- endif %}
        {{- '[/INST]' }}

    {#- Assistant messages supports text content or text and image chunks. #}
    {%- elif message['role'] == 'assistant' %}
        {%- if message['content'] is string %}
            {{- message['content'] }}
        {%- elif message['content'] is iterable and message['content']|length > 0 %}
            {%- for block in message['content'] %}
                {%- if block['type'] == 'text' %}
                    {{- block['text'] }}
                {%- endif %}
            {%- endfor %}
        {%- endif %}

        {#- Handle Tool Calls #}
        {%- if message.get('tool_calls') %}
            {%- for tool in message['tool_calls'] %}
                {%- set function_name = tool['function']['name'] %}
                {%- set arguments = tool['function']['arguments'] %}
                
                {#- Ensure arguments are a valid string for JSON #}
                {%- if arguments is not string %}
                    {%- set arguments = arguments|tojson %}
                {%- elif arguments == '' %}
                    {%- set arguments = '{}' %}
                {%- endif %}
                
                {{- '[TOOL_CALLS]' + function_name + '[ARGS]' + arguments }}
            {%- endfor %}
        {%- endif %}

        {#- End of sequence token for each assistant message. #}
        {{- eos_token }}

    {#- Tool messages only supports text content. #}
    {%- elif message['role'] == 'tool' %}
        {{- '[TOOL_RESULTS]' + message['content']|string + '[/TOOL_RESULTS]' }}
    {%- endif %}
{%- endfor %}

I'm curious though, you are quantizing your KV layers even more despite already loading a Q5. Is this to overcome the limits of your GPU or something else, and why not use your other Q4 quant then?

Another question is regarding the KV layers, in both Q4 and Q5 versions, V is left at Q6 while K is quantized down. I've read in a lot of places that K is more sensitive to quantization than V, and should be left higher if possible. Maybe reversing the K and V so K is at Q6 and V is at Q4 or Q5 would yield better results (there should be no change in overal model size)?

Also, why such a high temp at 0.6, do you need it to be more creative (the suggested temp is 0.15)?

Here's is how I'm testing it using ik_llama.cpp:

sync && echo 3 > sudo tee /proc/sys/vm/drop_caches
free -h
export CUDA_VISIBLE_DEVICES=0,1
export GGML_CUDA_GRAPH_OPT=1
./build/bin/llama-server
-ngl 99
-t 1
-c 131072
-sm graph
-muge
-ger
-smf32
--max-gpu 2
--main-gpu 0
--model "models/adamjen_Devstral-Small-2-24B-Opus-Reasoning/Devstral-Small-2-24B-Opus-Reasoning.Q5_K_M.gguf"
--jinja
-np 1
--host 0.0.0.0
--port 8081
--api-key 12345
--alias "devstral-small-2"
--temp 0.15 --top-p 0.95 --top-k 40 --min-p 0.01
--flash-attn on
-cuda fa-offset=0
--seed 3407
--batch-size 4096
--ubatch-size 2048
--no-mmap
--reasoning-tokens none
--chat-template-kwargs "{"enable_thinking": false}"
--chat-template-file "models/adamjen_Devstral-Small-2-24B-Opus-Reasoning/chat_template.jinja"

Sign up or log in to comment