HumanEval Benchmarks: GLM-4.7-Flash (UD Q4/Q5)
Thanks for sharing these unsloth quants!
I've been running some local benchmarks to validate performance and wanted to share my findings.
I notice that the Q5 quant does not outperform the Q4 quant. I tested oss-20b and few other models to compare.
I am wondering if these results align with what others are seeing, or if there might be an issue with my specific parameter setup (llama-server settings or lm_eval).
The Q4/Q5 GLM-4.7-Flash models respond fine to webui chat and behave well in Opencoder, using tools and interacting with the environment. However, the HumanEval result seems low?
Here is a summary of the runs and my reproduction protocol.
π Results Summary
Task: humaneval (0-shot)
Backend: llama.cpp server (via lm_eval API)
| Model | Quant | Temp | Samplers | Pass@1 | Notes |
|---|---|---|---|---|---|
| lmstudio-community/GLM-4.7-Flash-GGUF | Q4_K_M |
0.0 | Greedy | 37.2% | |
| unsloth/GLM-4.7-Flash-GGUF | UD-Q4_K_XL |
0.0 | Greedy | 37.80% | |
| unsloth/GLM-4.7-Flash-GGUF | UD-Q5_K_XL |
0.0 | Greedy | 35.37% | Unexpected: Lower/Same as Q4 |
| unsloth/GLM-4.7-Flash-GGUF | UD-Q5_K_XL |
0.0 | Greedy | 35.98% | Another run to confirm |
| unsloth/GLM-4.7-Flash-REAP-23B-A3B-GGUF | UD-Q4_K_XL |
0.0 | Greedy | 32.93% | |
| unsloth/gpt-oss-20b-GGUF | UD-Q8_K_XL |
0.0 | Greedy | 26.22% | Seems low for Q8 |
| unsloth/Nemotron-3-Nano-30B-A3B-GGUF | UD-Q4_K_XL |
0.0 | Greedy | 21.34% | |
| unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF | UD-Q4_K_XL |
0.0 | Greedy | 64.63% | |
| janhq/Jan-v3-4B-base-instruct-gguf | Q6_K |
0.0 | Greedy | 74.39% | Outsider... |
π₯οΈ Hardware Setup
- GPU: NVIDIA GTX 1070 (8GB VRAM)
- CPU: AMD Ryzen 5800X3D
- RAM: 32GB DDR4 3600
- Environment: Docker (
lm_eval) + Localllama-server
π§ͺ Reproduction Protocol
1. llama.cpp Server Arguments
Used the following flags for the server:
Greedy Run (Temp 0.0):
Changes: Temp 0.0, Top-P 1.0, Min-P 0.0.
llama-server -m [MODEL_PATH] \
--host 0.0.0.0 --port 1234 -c 16384 --threads 8 --parallel 2 \
-fa on --jinja --repeat-penalty 1.0 \
--temp 0.0 --top-p 1.0 --min-p 0.0 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--n-cpu-moe 33 -b 1024 -ub 256
2. Evaluation Container
I used a dedicated Docker container to run lm_eval against the local server to ensure a clean environment.
Dockerfile
FROM python:3.12-slim
WORKDIR /app
RUN pip install --no-cache-dir "lm_eval[api]" transformers
CMD ["lm_eval", "--help"]
docker-compose.yml
services:
lm_eval:
build: .
container_name: lm_eval_runner
network_mode: host
volumes:
- ./eval_results:/app/eval_results
environment:
- HF_ALLOW_CODE_EVAL=1
command: >
lm_eval --model local-completions
--model_args base_url=http://127.0.0.1:1234/v1/completions,num_concurrent=2,model=zai-org/GLM-4.7-Flash
--tasks humaneval
--confirm_run_unsafe_code
--log_samples
--output_path /app/eval_results
# --model_args base_url=http://127.0.0.1:1234/v1/completions,num_concurrent=2,model=openai/gpt-oss-20b
Execution Script
#!/bin/bash
docker compose down
docker compose up --build -d --remove-orphans
docker logs -f lm_eval_runner
Has anyone else benchmarked the GLM-4.7-Flash GGUF model? I'm curious if the performance is expected at these quantization levels or is due to misconfiguration.