@mitkox on Hugging Face: "GLM-4.7-Flash is fast, good and cheap. 3,074 tokens/sec peak at 200k tokens…"

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

posted an update Jan 21

Post

1583

GLM-4.7-Flash is fast, good and cheap.
3,074 tokens/sec peak at 200k tokens context window on my desktop PC.
Works with Claude Code and opencode for hours. No errors, drop-in replacement of the Anthropic cloud AI.
MIT licensed, open weights, free for commercial use and modifications.
Supports speculative decoding using MTP, which is highly effective in mitigating latency.
Great for on device AI coding as AWQ 4bit at 18.5 GB. Hybrid inference on a single consumer GPU + CPU RAM.

unmodeled-tyler

Jan 21

"good, fast, and cheap" are the magic words!

double-brown

Jan 22

Usually it's a case of choose two

ProgramerSalar

Jan 23

Fast, capable, and truly open — it hits a rare trifecta. With a 200k context window, peak throughput over 3k tokens/sec, and full MIT-licensed open weights for commercial use, it’s a serious drop-in replacement for cloud APIs like Claude.

In this post