How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="cafonez/GLM-4.7-Flash-REAP-23B-A3B-ROCmFP4-GGUF",
	filename="GLM-4.7-Flash-REAP-23B-A3B-BF16-to-ROCmFP4-STRIX_LEAN.gguf",
)
llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

GLM 4.7 Flash REAP 23B-A3B ROCmFP4 GGUF

Unofficial GGUF quantization of GLM 4.7 Flash REAP 23B-A3B for local llama.cpp testing on AMD/Strix systems.

File

File Size
GLM-4.7-Flash-REAP-23B-A3B-BF16-to-ROCmFP4-STRIX_LEAN.gguf 12,292,302,592 bytes, about 11.45 GiB

Filesystem display: about 12G.

Quantization

  • Source precision: BF16 GGUF input
  • Output format: Q4_0_ROCMFP4
  • Variant: STRIX_LEAN
  • Runtime target: llama.cpp ROCmFP4 build with Vulkan or HIP/ROCm backend

Local Test Results

Quick local results on the maintainer's Strix setup:

Backend / mode Result
Vulkan, tg8 ~73.28 tok/s
Vulkan, tg128 ~50.16 tok/s
Vulkan, tg256 ~48.30 tok/s
HIP + GGML_HIP_ENABLE_UNIFIED_MEMORY=1, tg8 ~59.39 tok/s
HIP + GGML_HIP_ENABLE_UNIFIED_MEMORY=1, tg128 ~41.46 tok/s
HIP + GGML_HIP_ENABLE_UNIFIED_MEMORY=1, tg256 ~40.09 tok/s
Vulkan Wikitext-2 quick PPL, ctx 2048, chunks 8 ~15.4046 +/- 0.54576
HIP Wikitext-2 quick PPL, ctx 2048, chunks 8 ~14.7812 +/- 0.51969

Vulkan was preferred for interactive chat speed in local testing. HIP required unified memory on this setup.

llama.cpp Chat Example

Continuous Vulkan chat:

llama-cli \
  -m GLM-4.7-Flash-REAP-23B-A3B-BF16-to-ROCmFP4-STRIX_LEAN.gguf \
  -cnv \
  -ngl 99 \
  -fa on \
  --no-mmap \
  --jinja \
  --reasoning off \
  -c 32768 \
  -b 512 \
  -ub 512 \
  -ctk q4_0 \
  -ctv q4_0

Do not use -mli for normal Enter-to-send chat unless you intentionally want multiline input behavior.

HIP/ROCm example:

GGML_HIP_ENABLE_UNIFIED_MEMORY=1 llama-cli \
  -m GLM-4.7-Flash-REAP-23B-A3B-BF16-to-ROCmFP4-STRIX_LEAN.gguf \
  -cnv \
  -ngl 99 \
  -fa on \
  --no-mmap \
  --jinja \
  --reasoning off \
  -c 32768 \
  -b 512 \
  -ub 512 \
  -ctk q4_0 \
  -ctv q4_0

For coherence checks, compare -ctk q8_0 -ctv q8_0 against q4_0.

Notes

This is an experimental local-inference quantization. Confirm that your use complies with the upstream GLM model license and any applicable terms.

Downloads last month
281
GGUF
Model size
23B params
Architecture
deepseek2
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cafonez/GLM-4.7-Flash-REAP-23B-A3B-ROCmFP4-GGUF

Quantized
(81)
this model