Instructions to use tiny-random/glm-5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use tiny-random/glm-5 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="tiny-random/glm-5")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("tiny-random/glm-5")
model = AutoModelForCausalLM.from_pretrained("tiny-random/glm-5")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use tiny-random/glm-5 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "tiny-random/glm-5"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiny-random/glm-5",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/tiny-random/glm-5

SGLang

How to use tiny-random/glm-5 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "tiny-random/glm-5" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiny-random/glm-5",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "tiny-random/glm-5" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "tiny-random/glm-5",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use tiny-random/glm-5 with Docker Model Runner:
```
docker model run hf.co/tiny-random/glm-5
```

RuntimeError: split_with_sizes expects split_sizes have only non-negative entries, but got split_sizes=[64, -32]

by nikita-savelyev-cerebras - opened Feb 26

Discussion

nikita-savelyev-cerebras

Feb 26

Hi! Thanks for preparing the model.

I observe the following error when running sample script for inference with Transformers:

/venv/bin/python tmp.py 
Loading weights: 100%|██████████| 30/30 [00:00<00:00, 282.85it/s, Materializing param=model.norm.weight]
GlmMoeDsaForCausalLM LOAD REPORT from: tiny-random/glm-5
Key                                                       | Status     | 
----------------------------------------------------------+------------+-
model.layers.2.mlp.gate.e_score_correction_bias           | UNEXPECTED | 
model.layers.{0, 1, 2}.self_attn.k_norm.weight            | UNEXPECTED | 
model.layers.2.self_attn.q_b_proj.weight                  | UNEXPECTED | 
model.layers.2.self_attn.q_a_layernorm.weight             | UNEXPECTED | 
model.layers.2.self_attn.o_proj.weight                    | UNEXPECTED | 
model.layers.2.shared_head.norm.weight                    | UNEXPECTED | 
model.layers.{0, 1, 2}.self_attn.wq_b.weight              | UNEXPECTED | 
model.layers.2.mlp.shared_experts.gate_proj.weight        | UNEXPECTED | 
model.layers.{0, 1, 2}.self_attn.weights_proj.weight      | UNEXPECTED | 
model.layers.2.input_layernorm.weight                     | UNEXPECTED | 
model.layers.2.self_attn.kv_a_proj_with_mqa.weight        | UNEXPECTED | 
model.layers.2.self_attn.kv_b_proj.weight                 | UNEXPECTED | 
model.layers.2.enorm.weight                               | UNEXPECTED | 
model.layers.{0, 1, 2}.self_attn.wk.weight                | UNEXPECTED | 
model.layers.2.mlp.experts.down_proj                      | UNEXPECTED | 
model.layers.2.eh_proj.weight                             | UNEXPECTED | 
model.layers.2.mlp.shared_experts.down_proj.weight        | UNEXPECTED | 
model.layers.2.mlp.gate.weight                            | UNEXPECTED | 
model.layers.2.mlp.experts.gate_up_proj                   | UNEXPECTED | 
model.layers.2.self_attn.q_a_proj.weight                  | UNEXPECTED | 
model.layers.2.hnorm.weight                               | UNEXPECTED | 
model.layers.2.mlp.shared_experts.up_proj.weight          | UNEXPECTED | 
model.layers.2.post_attention_layernorm.weight            | UNEXPECTED | 
model.layers.2.self_attn.kv_a_layernorm.weight            | UNEXPECTED | 
model.layers.{0, 1}.self_attn.indexer.wk.weight           | MISSING    | 
model.layers.{0, 1}.self_attn.indexer.k_norm.bias         | MISSING    | 
model.layers.{0, 1}.self_attn.indexer.weights_proj.weight | MISSING    | 
model.layers.{0, 1}.self_attn.indexer.k_norm.weight       | MISSING    | 
model.layers.{0, 1}.self_attn.indexer.wq_b.weight         | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Traceback (most recent call last):
  File "tmp.py", line 14, in <module>
    generated_ids = model.generate(input_ids, max_new_tokens=32)
  File "/venv/lib/python3.13/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
    return func(*args, **kwargs)
  File "/venv/lib/python3.13/site-packages/transformers/generation/utils.py", line 2668, in generate
    result = decoding_method(
        self,
    ...<5 lines>...
        **model_kwargs,
    )
  File "/venv/lib/python3.13/site-packages/transformers/generation/utils.py", line 2863, in _sample
    outputs = self._prefill(input_ids, generation_config, model_kwargs)
  File "/venv/lib/python3.13/site-packages/transformers/generation/utils.py", line 3857, in _prefill
    return self(**model_inputs, return_dict=True)
  File "/venv/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
  File "/venv/lib/python3.13/site-packages/transformers/utils/generic.py", line 841, in wrapper
    output = func(self, *args, **kwargs)
  File "/venv/lib/python3.13/site-packages/transformers/models/glm_moe_dsa/modeling_glm_moe_dsa.py", line 864, in forward
    outputs: BaseModelOutputWithPast = self.model(
                                       ~~~~~~~~~~^
        input_ids=input_ids,
        ^^^^^^^^^^^^^^^^^^^^
    ...<6 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/venv/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
  File "/venv/lib/python3.13/site-packages/transformers/utils/generic.py", line 915, in wrapper
    output = func(self, *args, **kwargs)
  File "/venv/lib/python3.13/site-packages/transformers/utils/output_capturing.py", line 253, in wrapper
    outputs = func(self, *args, **kwargs)
  File "/venv/lib/python3.13/site-packages/transformers/models/glm_moe_dsa/modeling_glm_moe_dsa.py", line 799, in forward
    hidden_states = decoder_layer(
        hidden_states,
    ...<6 lines>...
        **kwargs,
    )
  File "/venv/lib/python3.13/site-packages/transformers/modeling_layers.py", line 93, in __call__
    return super().__call__(*args, **kwargs)
           ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
  File "/venv/lib/python3.13/site-packages/transformers/models/glm_moe_dsa/modeling_glm_moe_dsa.py", line 614, in forward
    hidden_states, _ = self.self_attn(
                       ~~~~~~~~~~~~~~^
        hidden_states=hidden_states,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<6 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/venv/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
  File "/venv/lib/python3.13/site-packages/transformers/models/glm_moe_dsa/modeling_glm_moe_dsa.py", line 398, in forward
    topk_indices = self.indexer(
        hidden_states,
    ...<3 lines>...
        use_cache=past_key_values is not None,
    )  # [B, S, topk]
  File "/venv/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
  File "/venv/lib/python3.13/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
    return func(*args, **kwargs)
  File "/venv/lib/python3.13/site-packages/transformers/models/glm_moe_dsa/modeling_glm_moe_dsa.py", line 181, in forward
    q_pe, q_nope = torch.split(q, [self.qk_rope_head_dim, self.head_dim - self.qk_rope_head_dim], dim=-1)
                   ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.13/site-packages/torch/functional.py", line 173, in split
    return tensor.split(split_size_or_sections, dim)
           ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.13/site-packages/torch/_tensor.py", line 1066, in split
    return torch._VF.split_with_sizes(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~^
        self,
        ^^^^^
    ...<2 lines>...
        dim,
        ^^^^
    )
    ^
RuntimeError: split_with_sizes expects split_sizes have only non-negative entries, but got split_sizes=[64, -32]

Process finished with exit code 1

Environment:

accelerate==1.12.0
aiohappyeyeballs==2.6.1
aiohttp==3.13.3
aiosignal==1.4.0
annotated-doc==0.0.4
anyio==4.12.1
attrs==25.4.0
black==26.1.0
certifi==2026.1.4
charset-normalizer==3.4.4
click==8.3.1
cuda-bindings==12.9.4
cuda-pathfinder==1.3.5
datasets==4.5.0
dill==0.4.0
filelock==3.24.3
frozenlist==1.8.0
fsspec==2025.10.0
h11==0.16.0
hf-xet==1.3.0
httpcore==1.0.9
httpx==0.28.1
huggingface_hub==1.4.1
idna==3.11
iniconfig==2.3.0
Jinja2==3.1.6
markdown-it-py==4.0.0
MarkupSafe==3.0.3
mdurl==0.1.2
mpmath==1.3.0
multidict==6.7.1
multiprocess==0.70.18
mypy_extensions==1.1.0
networkx==3.6.1
ninja==1.13.0
numpy==2.2.6
nvidia-cublas-cu12==12.8.4.1
nvidia-cuda-cupti-cu12==12.8.90
nvidia-cuda-nvrtc-cu12==12.8.93
nvidia-cuda-runtime-cu12==12.8.90
nvidia-cudnn-cu12==9.10.2.21
nvidia-cufft-cu12==11.3.3.83
nvidia-cufile-cu12==1.13.1.3
nvidia-curand-cu12==10.3.9.90
nvidia-cusolver-cu12==11.7.3.90
nvidia-cusparse-cu12==12.5.8.93
nvidia-cusparselt-cu12==0.7.1
nvidia-nccl-cu12==2.27.5
nvidia-nvjitlink-cu12==12.8.93
nvidia-nvshmem-cu12==3.4.5
nvidia-nvtx-cu12==12.8.90
packaging==26.0
pandas==3.0.1
pathspec==1.0.4
platformdirs==4.9.2
pluggy==1.6.0
propcache==0.4.1
psutil==7.2.2
pyarrow==23.0.1
Pygments==2.19.2
pytest==9.0.2
python-dateutil==2.9.0.post0
pytokens==0.4.1
PyYAML==6.0.3
regex==2026.2.19
requests==2.32.5
rich==14.3.3
ruff==0.14.14
safetensors==0.7.0
setuptools==82.0.0
shellingham==1.5.4
six==1.17.0
sympy==1.14.0
tokenizers==0.22.2
torch==2.10.0
tqdm==4.67.3
transformers==5.2.0
triton==3.6.0
typer==0.24.1
typer-slim==0.24.0
typing_extensions==4.15.0
urllib3==2.6.3
xxhash==3.6.0
yarl==1.22.0
zstandard==0.25.0

Perhaps something wrong with the config?

nikita-savelyev-cerebras

Feb 26

Should be fixed by https://huggingface.co/tiny-random/glm-5/discussions/2

yujiepan

tiny-random org Feb 27

Hi Nikita, thanks for your investigation! I've updated the models

yujiepan changed discussion status to closed Feb 27

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment