HumanEval Benchmarks: GLM-4.7-Flash (UD Q4/Q5)

#22
by k5uu5 - opened

Thanks for sharing these unsloth quants!

I've been running some local benchmarks to validate performance and wanted to share my findings.

I notice that the Q5 quant does not outperform the Q4 quant. I tested oss-20b and few other models to compare.

I am wondering if these results align with what others are seeing, or if there might be an issue with my specific parameter setup (llama-server settings or lm_eval).

The Q4/Q5 GLM-4.7-Flash models respond fine to webui chat and behave well in Opencoder, using tools and interacting with the environment. However, the HumanEval result seems low?

Here is a summary of the runs and my reproduction protocol.

πŸ“Š Results Summary

Task: humaneval (0-shot)
Backend: llama.cpp server (via lm_eval API)

Model Quant Temp Samplers Pass@1 Notes
lmstudio-community/GLM-4.7-Flash-GGUF Q4_K_M 0.0 Greedy 37.2%
unsloth/GLM-4.7-Flash-GGUF UD-Q4_K_XL 0.0 Greedy 37.80%
unsloth/GLM-4.7-Flash-GGUF UD-Q5_K_XL 0.0 Greedy 35.37% Unexpected: Lower/Same as Q4
unsloth/GLM-4.7-Flash-GGUF UD-Q5_K_XL 0.0 Greedy 35.98% Another run to confirm
unsloth/GLM-4.7-Flash-REAP-23B-A3B-GGUF UD-Q4_K_XL 0.0 Greedy 32.93%
unsloth/gpt-oss-20b-GGUF UD-Q8_K_XL 0.0 Greedy 26.22% Seems low for Q8
unsloth/Nemotron-3-Nano-30B-A3B-GGUF UD-Q4_K_XL 0.0 Greedy 21.34%
unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF UD-Q4_K_XL 0.0 Greedy 64.63%
janhq/Jan-v3-4B-base-instruct-gguf Q6_K 0.0 Greedy 74.39% Outsider...

πŸ–₯️ Hardware Setup

  • GPU: NVIDIA GTX 1070 (8GB VRAM)
  • CPU: AMD Ryzen 5800X3D
  • RAM: 32GB DDR4 3600
  • Environment: Docker (lm_eval) + Local llama-server

πŸ§ͺ Reproduction Protocol

1. llama.cpp Server Arguments

Used the following flags for the server:

Greedy Run (Temp 0.0):
Changes: Temp 0.0, Top-P 1.0, Min-P 0.0.

llama-server -m [MODEL_PATH] \
  --host 0.0.0.0 --port 1234 -c 16384 --threads 8 --parallel 2 \
  -fa on --jinja --repeat-penalty 1.0 \
  --temp 0.0 --top-p 1.0 --min-p 0.0 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --n-cpu-moe 33 -b 1024 -ub 256

2. Evaluation Container

I used a dedicated Docker container to run lm_eval against the local server to ensure a clean environment.

Dockerfile

FROM python:3.12-slim
WORKDIR /app
RUN pip install --no-cache-dir "lm_eval[api]" transformers
CMD ["lm_eval", "--help"]

docker-compose.yml

services:
  lm_eval:
    build: .
    container_name: lm_eval_runner
    network_mode: host
    volumes:
      - ./eval_results:/app/eval_results
    environment:
      - HF_ALLOW_CODE_EVAL=1
    command: >
      lm_eval --model local-completions
      --model_args base_url=http://127.0.0.1:1234/v1/completions,num_concurrent=2,model=zai-org/GLM-4.7-Flash
      --tasks humaneval
      --confirm_run_unsafe_code
      --log_samples
      --output_path /app/eval_results
#     --model_args base_url=http://127.0.0.1:1234/v1/completions,num_concurrent=2,model=openai/gpt-oss-20b     

Execution Script

#!/bin/bash
docker compose down
docker compose up --build -d --remove-orphans
docker logs -f lm_eval_runner

Has anyone else benchmarked the GLM-4.7-Flash GGUF model? I'm curious if the performance is expected at these quantization levels or is due to misconfiguration.

Sign up or log in to comment