Looking forward to trying this!

#2
by dnhkng - opened

Downloading now :)

I have a weird NVL2 GH200 system (https://dnhkng.github.io/posts/hopper/), so I have lots of RAM (960GB) for model quantisation. Happy to help with conversions, just reach out!

PS. I will try them both (And I was a big fan of Exllama back in version 2), but if you had to pick one, tis or mratsim/GLM-4.7-EXL3? I understand this model works very well with Claude Code.

Downloading now :)

I have a weird NVL2 GH200 system (https://dnhkng.github.io/posts/hopper/), so I have lots of RAM (960GB) for model quantisation. Happy to help with conversions, just reach out!

Ok, now that I've died a little (reading about your system) and resurrected myself... how on Earth did you get that beast for such a price? I also live in Germany and I would go on foot to any Bavarian forest, with the cash money in one hand and a torch in the other, seeking for Bernhard! 👊

It was the deal of the century; pure luck I saw the offer in r/LocalLlama while browsing the new section (which I never normally do). I think he only sold it to me because I offer to come and collect it,.and he didn't want to bother with postage and insurance etc.

It was the deal of the century; pure luck I saw the offer in r/LocalLlama while browsing the new section (which I never normally do). I think he only sold it to me because I offer to come and collect it,.and he didn't want to bother with postage and insurance etc.

Just wrote him... maybe, for another German, willing to travel and pick up! :) Wünscht mir Glück!

PS. I will try them both (And I was a big fan of Exllama back in version 2), but if you had to pick one, tis or mratsim/GLM-4.7-EXL3? I understand this model works very well with Claude Code.

TabbyAPI / ExLlama v3 currently is doesn't work with tool calling https://github.com/theroyallab/tabbyAPI/pull/378#issuecomment-3679072283, which is why I've been quantizing Minimax.

I've been trying my hand at vibe-coding / reasoning for 2 weeks now:

One thing is that on 2x RTX Pro 6000, GLM-4.7-3.84bpw will be 2737 tok/s (depending on sampler, context length, ....) while this MiniMax quant is 80110.

The other model I want to try is MiMo-V2-Flash as it uses the attention sinks like GPT-OSS and on the current AWQ quant I reached over 130 tok/s but it kept looping and there is no proper "all expert calibration" file in llmcompressor and that can have devastating impact on coding abilities: https://avtc.github.io/aquarium-side-by-side/

I finally tried Mimo-v2-flash on open router, and it's often finished the thinking and answer before GLM4.7 has even started. However the main repo has several open issues on problems with Claude Code.

For Exllama v3 models, what about using a proxy between Claude Code and Tabby? Something like this might work, and its less work for the TabbyAPI maintainers: https://github.com/1rgs/claude-code-proxy

I saw on this patch for llama.cpp: https://github.com/ggml-org/llama.cpp/discussions/18005, and worked in this direction for a while, but it's probably only suitable for my Frankenstein server. On Q8 and by using unified RAM, I was getting slightly higher initial speeds for the first few dozen tokens (>50tps), but then it falls back to about 10-15.

It would be interesting to examine the REAP experts, and only load those into VRAM, and keep the rest in System RAM, for better coding.

I'm not even talking about Claude Code, just plain chat completions tool calls don't work.

I think either you need to make tabbyAPI support the tool calling format of GLM models (xml) or you need to convince GLM to do tool calls in json format. But at the proxy level, tool calls have a specific wire format that you can't change.

Screenshot_2026-01-09-20-17-46-980_com.whatsapp

This works great with Claude Code at first glance. I will test it more thoroughly over the weekend, but it was happily editing files, grepping and examining my code, all running locally 🤩

Screenshot_2026-01-09-20-17-46-980_com.whatsapp

This works great with Claude Code at first glance. I will test it more thoroughly over the weekend, but it was happily editing files, grepping and examining my code, all running locally 🤩

Keep the good news coming! This would be such a relief to have someone else's confirmation that it works locally in good terms with Claude. Curious how it behaves with long context and some large code refactoring... But your behemoth will probably not break a sweat in the process.

Please report back if you get more tests done. Thanks for the effort!

It's odd, i keep getting
API Error: 400 {"type":"error","error":{"type":"BadRequestError","message":"Message has tool role, but there was no previous assistant message with a tool call! Message has tool role, but there
was no previous assistant message with a tool call!"}}

With both claude code and codex

The writeup is here: https://dnhkng.github.io/posts/vllm-optimization-gh200/

It include some benchmarking, and the settings I used to get Claude Code working properly.

404 - File not found... Maybe a typo?

Thats weird.. I got the same ting when I used Incognito mode, but its working... Github CDN silliness i guess?

Working now! Thanks in advance for the effort of documenting it! <3

It's odd, i keep getting
API Error: 400 {"type":"error","error":{"type":"BadRequestError","message":"Message has tool role, but there was no previous assistant message with a tool call! Message has tool role, but there
was no previous assistant message with a tool call!"}}

With both claude code and codex

It's from the chat template: https://huggingface.co/mratsim/MiniMax-M2.1-FP8-INT4-AWQ/blob/main/chat_template.jinja#L133-L136

    {%- elif message.role == 'tool' -%}
    {%- if last_tool_call.name is none -%}
        {{- raise_exception("Message has tool role, but there was no previous assistant message with a tool call!") }}
    {%- endif -%}

There is a check that tool call messages are preceded with assistant message with actual tool call, because that's what the model is trained on. But for some reason, in your case the model receives a tool call result without requesting for a tool call.

It may be that there was compaction or something.

Thanks for your answer. It's happening at the very first step with both claude and codex but works fine with opencode.

@dnhkng was able to run it properly so I'm puzzled on what might be different.

For anyone having the same issue, I followed the claude configuration on https://dnhkng.github.io/posts/vllm-optimization-gh200/#wiring-claude-code-to-your-local-vllm and it worked fine:


export ANTHROPIC_BASE_URL="http://127.0.0.1:8000"
export ANTHROPIC_API_KEY="local-vllm"

# Force *all* Claude model aliases to your local vLLM model
export ANTHROPIC_MODEL="MiniMax-M2.1-FP8"
export ANTHROPIC_SMALL_FAST_MODEL="MiniMax-M2.1-FP8"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="MiniMax-M2.1-FP8"
export ANTHROPIC_DEFAULT_SONNET_MODEL="MiniMax-M2.1-FP8"
export ANTHROPIC_DEFAULT_OPUS_MODEL="MiniMax-M2.1-FP8"

# Optional but recommended
export CLAUDE_CODE_DISABLE_TELEMETRY=1
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
export API_TIMEOUT_MS=3000000

claude "$@"

I'm wondering if it was because i was overriding only the sonnet model.

Anyway, thanks for this model. It's amazing. I really wanted to try GLM 4.7 but not sure there is a way to have it run with enough context on 2x RTX6000 Blackwell.

The writeup is here: https://dnhkng.github.io/posts/vllm-optimization-gh200/

It include some benchmarking, and the settings I used to get Claude Code working properly.

Damn! I feel spoiled David! So much useful information to digest, so much things to learn from.
Thanks a lot! Much appreciated!

Sign up or log in to comment