HumanEval Benchmarks: GLM-4.7-Flash (UD Q4/Q5)

#22

by k5uu5 - opened Feb 2

Discussion

k5uu5

Feb 2

•

edited Feb 4

Thanks for sharing these unsloth quants!

I've been running some local benchmarks to validate performance and wanted to share my findings.

I notice that the Q5 quant does not outperform the Q4 quant. I tested oss-20b and few other models to compare.

I am wondering if these results align with what others are seeing, or if there might be an issue with my specific parameter setup (llama-server settings or lm_eval).

The Q4/Q5 GLM-4.7-Flash models respond fine to webui chat and behave well in Opencoder, using tools and interacting with the environment. However, the HumanEval result seems low?

Here is a summary of the runs and my reproduction protocol.

📊 Results Summary

Task: humaneval (0-shot)
Backend: llama.cpp server (via lm_eval API)

Model	Quant	Samplers	Pass@1	Notes
lmstudio-community/GLM-4.7-Flash-GGUF	`Q4_K_M`	Greedy	37.2%
unsloth/GLM-4.7-Flash-GGUF	`UD-Q4_K_XL`	Greedy	37.80%
unsloth/GLM-4.7-Flash-GGUF	`UD-Q5_K_XL`	Greedy	35.37%	Unexpected: Lower/Same as Q4
unsloth/GLM-4.7-Flash-GGUF	`UD-Q5_K_XL`	Greedy	35.98%	Another run to confirm
unsloth/GLM-4.7-Flash-REAP-23B-A3B-GGUF	`UD-Q4_K_XL`	Greedy	32.93%
unsloth/gpt-oss-20b-GGUF	`UD-Q8_K_XL`	Greedy	26.22%	Seems low for Q8
unsloth/Nemotron-3-Nano-30B-A3B-GGUF	`UD-Q4_K_XL`	Greedy	21.34%
unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF	`UD-Q4_K_XL`	Greedy	64.63%
janhq/Jan-v3-4B-base-instruct-gguf	`Q6_K`	Greedy	74.39%	Outsider...

🖥️ Hardware Setup

GPU: NVIDIA GTX 1070 (8GB VRAM)
CPU: AMD Ryzen 5800X3D
RAM: 32GB DDR4 3600
Environment: Docker (lm_eval) + Local llama-server

🧪 Reproduction Protocol

1. `llama.cpp` Server Arguments

Used the following flags for the server:

Greedy Run (Temp 0.0):
Changes: Temp 0.0, Top-P 1.0, Min-P 0.0.

llama-server -m [MODEL_PATH] \
  --host 0.0.0.0 --port 1234 -c 16384 --threads 8 --parallel 2 \
  -fa on --jinja --repeat-penalty 1.0 \
  --temp 0.0 --top-p 1.0 --min-p 0.0 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --n-cpu-moe 33 -b 1024 -ub 256

2. Evaluation Container

I used a dedicated Docker container to run lm_eval against the local server to ensure a clean environment.

Dockerfile

FROM python:3.12-slim
WORKDIR /app
RUN pip install --no-cache-dir "lm_eval[api]" transformers
CMD ["lm_eval", "--help"]

docker-compose.yml

services:
  lm_eval:
    build: .
    container_name: lm_eval_runner
    network_mode: host
    volumes:
      - ./eval_results:/app/eval_results
    environment:
      - HF_ALLOW_CODE_EVAL=1
    command: >
      lm_eval --model local-completions
      --model_args base_url=http://127.0.0.1:1234/v1/completions,num_concurrent=2,model=zai-org/GLM-4.7-Flash
      --tasks humaneval
      --confirm_run_unsafe_code
      --log_samples
      --output_path /app/eval_results
#     --model_args base_url=http://127.0.0.1:1234/v1/completions,num_concurrent=2,model=openai/gpt-oss-20b

Execution Script

#!/bin/bash
docker compose down
docker compose up --build -d --remove-orphans
docker logs -f lm_eval_runner

Has anyone else benchmarked the GLM-4.7-Flash GGUF model? I'm curious if the performance is expected at these quantization levels or is due to misconfiguration.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment