evilfreelancer/GLM-4.7-Flash-GGUF

Converted and quantized version of zai-org/GLM-4.7-Flash with help of this fix.

MXFP4_MOE quant is also avaialble

If you experience looping or repetition when using GLM 4.7 Flash, try adding --temp 1.0 --min-p 0.01 --top-p 0.95 --dry-multiplier 1.1 which can help, specifically --dry-multiplier 1.1.

Increase --dry-multiplier 1.1 to say --dry-multiplier 1.5 if you still experience issues.

Use --kv-unified for speeding up inference.

It is recommended to use at least 4-bit precision for best performance as well.

Also, adjust context window as required, up to 202752.

x-shared-logs: &shared-logs
  logging:
    driver: "json-file"
    options:
      max-size: "100k"

services:
  glm47-flash-30b:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda
    restart: unless-stopped
    volumes:
      - ./llama-cpp_data:/root/.cache
    ports:
      - "8080:8080"
    command: --host 0.0.0.0 --port 8080 -hf evilfreelancer/GLM-4.7-Flash-GGUF:MXFP4_MOE -fa 1 -ngl 99 -ub 4092 -b 4092 -c 202752 --jinja -np 10 -t 48 --threads-batch 96 --temp 1.0 --min-p 0.01 --top-p 0.95 --dry-multiplier 1.1
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            device_ids: [ '0', '1', '2', '3' ]
            capabilities: [ gpu ]
    <<: *shared-logs
Downloads last month
3,071
GGUF
Model size
30B params
Architecture
deepseek2
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for evilfreelancer/GLM-4.7-Flash-GGUF

Quantized
(50)
this model