Instructions to use AImhotep/GLM-4.7-REAP-265B-mixed-AutoRound with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AImhotep/GLM-4.7-REAP-265B-mixed-AutoRound with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="AImhotep/GLM-4.7-REAP-265B-mixed-AutoRound", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("AImhotep/GLM-4.7-REAP-265B-mixed-AutoRound", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("AImhotep/GLM-4.7-REAP-265B-mixed-AutoRound", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use AImhotep/GLM-4.7-REAP-265B-mixed-AutoRound with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AImhotep/GLM-4.7-REAP-265B-mixed-AutoRound"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AImhotep/GLM-4.7-REAP-265B-mixed-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/AImhotep/GLM-4.7-REAP-265B-mixed-AutoRound

SGLang

How to use AImhotep/GLM-4.7-REAP-265B-mixed-AutoRound with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AImhotep/GLM-4.7-REAP-265B-mixed-AutoRound" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AImhotep/GLM-4.7-REAP-265B-mixed-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AImhotep/GLM-4.7-REAP-265B-mixed-AutoRound" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AImhotep/GLM-4.7-REAP-265B-mixed-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use AImhotep/GLM-4.7-REAP-265B-mixed-AutoRound with Docker Model Runner:
```
docker model run hf.co/AImhotep/GLM-4.7-REAP-265B-mixed-AutoRound
```

REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression
Paper • Code • Blog

GLM-4.7-REAP-265B-mixed-Autoround

Note:

This is the second version that I'm sharing here.

It's proven to be much better at coding, does not fall into loops and does not have problems with math or tool calling.

Summary

26% Expert-Pruned GLM-4.7 optimized for coding, function call, and agentic workflows, this version works well with Roo Code.

Created using REAP (Router-weighted Expert Activation Pruning) by Cerebras:

265B: 26% of MoE experts pruned 119/160 - fits perfectly on 2xRTX6000 Pro with 110k of fp16 context or 230k with fp8_e5m2
calibrated for coding: dataset: agent_calibration_mix_v6.jsonl
Works with vLLM

Acknowledgments

Cerebras — REAP methodology

Model Specifications

Property	Value
Base Model	zai/glm-4.7
Architecture	Sparse Mixture-of-Experts (SMoE)
Original Parameters	358B
Pruned Parameters	265B
Compression	26% experts removed
Experts per Layer	119 (was 160)
MoE Layers	92
Activated Experts	8 per token
Precision	Autoround 4/8bit + original precision for most important layers
Disk Size	~140GB
VRAM Required	184,5GB with fp16 context with 110k tokens; 185GB with fp8_e5m2 context with 230k tokens possible (full context of the model is 202752)

Calibration Dataset

REAP'd model quality strongly depends on the calibration dataset used to detect the most and least activated experts.

This detection procedure is called "observations" and is quite computationally expensive, especially when hardware is limited. I have less VRAM than the original model size, so it had to be offloaded to RAM and even partially to NVMe storage. The original observations procedure for 1,024 samples took more than 100 hours on my hardware, so I modified the original observer to test all experts on a per-layer basis with flash attention enabled. This significantly lowered the VRAM requirements (I was able to complete it with ~24GB of VRAM) and reduced the time needed to ~25 hours.

Once the observations file is generated, it can be reused to perform multiple REAP operations (e.g., 10%, 30%, or any other ratio you want).

I used a mix of datasets to create the 1024 from 8192 samples:

m-a-p/CodeFeedback-Filtered-Instruction: 10.0%
ise-uiuc/Magicoder-Evol-Instruct-110K: 15.0%
allenai/tulu-3-sft-mixture: 10.0%
TeichAI/claude-4.5-opus-high-reasoning-250x: 10.0%
theblackcat102/evol-codealpaca-v1: 10.0%
cais/mmlu (abstract_algebra): 0.4%
cais/mmlu (college_computer_science): 0.4%
cais/mmlu (college_mathematics): 0.4%
cais/mmlu (college_physics): 0.4%
cais/mmlu (computer_security): 0.4%
cais/mmlu (conceptual_physics): 0.4%
cais/mmlu (elementary_mathematics): 0.4%
cais/mmlu (formal_logic): 0.4%
cais/mmlu (high_school_computer_science): 0.4%
cais/mmlu (high_school_mathematics): 0.4%
cais/mmlu (high_school_physics): 0.4%
cais/mmlu (machine_learning): 0.4%
allenai/tulu-3-sft-personas-math: 10.0%
Salesforce/xlam-function-calling-60k: 12.0%
glaiveai/glaive-function-calling-v2: 8.0%
euclaise/WritingPrompts_curated: 10.0%

Resulting Dataset

AImhotep/agent_calibration_mix_v6

run in VLLM example

Example of glm47/generation_config.json

{
  "_from_model_config": true,
  "do_sample": true,
  "pad_token_id": 151329,
  "eos_token_id": [151329, 151336, 151338],
  "top_p": 0.95,
  "temperature": 0.7,
  "repetition_penalty": 1.05,
  "presence_penalty": 1.5,
  "transformers_version": "4.57.3",
  "top_k": 40,
  "min_p": 0.01
}

#!/bin/bash

#
# I'm using VLLM 0.15 - latest as of 2026-01-29
#
cd /root/vllm

source .venv/bin/activate

export CUDA_DEVICE_ORDER=PCI_BUS_ID

export TORCH_ALLOW_TF32=1

export PYTORCH_CUDA_ALLOC_CONF=""

export TRANSFORMERS_VERBOSITY=info

export VLLM_ATTENTION_BACKEND="FLASHINFER"
export TORCH_CUDA_ARCH_LIST="12.0"
export CUDA_VISIBLE_DEVICES=1,2
export VLLM_MARLIN_USE_ATOMIC_ADD=1
export SAFETENSORS_FAST_GPU=1
export OMP_NUM_THREADS=62

# export VLLM_USE_FLASHINFER_MOE_FP8=1
export VLLM_FLASHINFER_MOE_BACKEND=latency


# without this vllm uses 100% on couple of cores - more power used on idle
export VLLM_SLEEP_WHEN_IDLE=1

export NCCL_ALGO=Ring
export NCCL_PROTO=Simple
export NCCL_MIN_NCHANNELS=4
export NCCL_MAX_NCHANNELS=8
export NCCL_BUFFSIZE=8388608


vllm serve --model AImhotep/GLM-4.7-REAP-265B-mixed-AutoRound \
    --served-model-name "GLM-4.7-REAP-26" \
    --tensor-parallel-size 2 \
    --uvicorn-log-level info \
    --trust-remote-code \
    --gpu-memory-utilization 0.95 \
    --max-num-seqs 1 \
    --seed 42 \
    --max-model-len auto \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --enable-auto-tool-choice \
    --enable-sleep-mode \
    --enable-expert-parallel \
    --generation-config /root/vllm/scripts/glm-47 \
    --compilation-config '{"level": 3, "cudagraph_capture_sizes": [1]}' \
    --allow-deprecated-quantization \
    --host 0.0.0.0 \
    --port 11110

Citation

@article{lasby2025reap,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025},
  url={https://arxiv.org/abs/2510.13999}
}

Downloads last month: 83

Safetensors

Model size

2B params

Tensor type

F32

I32

BF16

F16

Datasets used to train AImhotep/GLM-4.7-REAP-265B-mixed-AutoRound

Paper for AImhotep/GLM-4.7-REAP-265B-mixed-AutoRound

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 20