Liked this model more than unsloth's!
#3
by d9k - opened
Used zai-org_GLM-4.7-Flash-IQ2_M.gguf on my cheap 12 GB RTX 3306 videocard in LM Studio:
Context length: 60000
GPU offload: 40
Result: 25+ tok/sec.
From model card:
GLM-4.7-Flash-IQ2_M.gguf / IQ2_M / 9.85GB / Relatively low quality, uses SOTA techniques to be surprisingly usable.
My settings for using model with tools are:
temperature: 0.6
Top K sampling: 50
Repeat penalty: [disabled checkbox]
Top P sampling: 0.98
Min P sampling: 0.03
VS Code Roo Code agent extension settings:
API provider: LM studio
Base URL: http://127.0.0.1:1234
Model: zai-org_glm-4.7-flash
Also had to use nvidia prime mode and __NV_PRIME_RENDER_OFFLOAD=1 to free video memory by offloading processes to integrated videocard (use nvidia-smi or nvtop to double check video memory occupied by processes)