Liked this model more than unsloth's!

#3
by d9k - opened

Used zai-org_GLM-4.7-Flash-IQ2_M.gguf on my cheap 12 GB RTX 3306 videocard in LM Studio:

Context length: 60000
GPU offload: 40

Result: 25+ tok/sec.

From model card:

GLM-4.7-Flash-IQ2_M.gguf / IQ2_M / 9.85GB / Relatively low quality, uses SOTA techniques to be surprisingly usable.

My settings for using model with tools are:

temperature: 0.6
Top K sampling: 50
Repeat penalty: [disabled checkbox]
Top P sampling: 0.98
Min P sampling: 0.03

VS Code Roo Code agent extension settings:

API provider: LM studio
Base URL: http://127.0.0.1:1234
Model: zai-org_glm-4.7-flash

Also had to use nvidia prime mode and __NV_PRIME_RENDER_OFFLOAD=1 to free video memory by offloading processes to integrated videocard (use nvidia-smi or nvtop to double check video memory occupied by processes)

Sign up or log in to comment