Instructions to use tiny-random/glm-5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use tiny-random/glm-5 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="tiny-random/glm-5") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("tiny-random/glm-5") model = AutoModelForCausalLM.from_pretrained("tiny-random/glm-5") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use tiny-random/glm-5 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "tiny-random/glm-5" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiny-random/glm-5", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/tiny-random/glm-5
- SGLang
How to use tiny-random/glm-5 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "tiny-random/glm-5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiny-random/glm-5", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "tiny-random/glm-5" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tiny-random/glm-5", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use tiny-random/glm-5 with Docker Model Runner:
docker model run hf.co/tiny-random/glm-5
RuntimeError: split_with_sizes expects split_sizes have only non-negative entries, but got split_sizes=[64, -32]
#1
by nikita-savelyev-cerebras - opened
Hi! Thanks for preparing the model.
I observe the following error when running sample script for inference with Transformers:
/venv/bin/python tmp.py
Loading weights: 100%|ββββββββββ| 30/30 [00:00<00:00, 282.85it/s, Materializing param=model.norm.weight]
GlmMoeDsaForCausalLM LOAD REPORT from: tiny-random/glm-5
Key | Status |
----------------------------------------------------------+------------+-
model.layers.2.mlp.gate.e_score_correction_bias | UNEXPECTED |
model.layers.{0, 1, 2}.self_attn.k_norm.weight | UNEXPECTED |
model.layers.2.self_attn.q_b_proj.weight | UNEXPECTED |
model.layers.2.self_attn.q_a_layernorm.weight | UNEXPECTED |
model.layers.2.self_attn.o_proj.weight | UNEXPECTED |
model.layers.2.shared_head.norm.weight | UNEXPECTED |
model.layers.{0, 1, 2}.self_attn.wq_b.weight | UNEXPECTED |
model.layers.2.mlp.shared_experts.gate_proj.weight | UNEXPECTED |
model.layers.{0, 1, 2}.self_attn.weights_proj.weight | UNEXPECTED |
model.layers.2.input_layernorm.weight | UNEXPECTED |
model.layers.2.self_attn.kv_a_proj_with_mqa.weight | UNEXPECTED |
model.layers.2.self_attn.kv_b_proj.weight | UNEXPECTED |
model.layers.2.enorm.weight | UNEXPECTED |
model.layers.{0, 1, 2}.self_attn.wk.weight | UNEXPECTED |
model.layers.2.mlp.experts.down_proj | UNEXPECTED |
model.layers.2.eh_proj.weight | UNEXPECTED |
model.layers.2.mlp.shared_experts.down_proj.weight | UNEXPECTED |
model.layers.2.mlp.gate.weight | UNEXPECTED |
model.layers.2.mlp.experts.gate_up_proj | UNEXPECTED |
model.layers.2.self_attn.q_a_proj.weight | UNEXPECTED |
model.layers.2.hnorm.weight | UNEXPECTED |
model.layers.2.mlp.shared_experts.up_proj.weight | UNEXPECTED |
model.layers.2.post_attention_layernorm.weight | UNEXPECTED |
model.layers.2.self_attn.kv_a_layernorm.weight | UNEXPECTED |
model.layers.{0, 1}.self_attn.indexer.wk.weight | MISSING |
model.layers.{0, 1}.self_attn.indexer.k_norm.bias | MISSING |
model.layers.{0, 1}.self_attn.indexer.weights_proj.weight | MISSING |
model.layers.{0, 1}.self_attn.indexer.k_norm.weight | MISSING |
model.layers.{0, 1}.self_attn.indexer.wq_b.weight | MISSING |
Notes:
- UNEXPECTED :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING :those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Traceback (most recent call last):
File "tmp.py", line 14, in <module>
generated_ids = model.generate(input_ids, max_new_tokens=32)
File "/venv/lib/python3.13/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
return func(*args, **kwargs)
File "/venv/lib/python3.13/site-packages/transformers/generation/utils.py", line 2668, in generate
result = decoding_method(
self,
...<5 lines>...
**model_kwargs,
)
File "/venv/lib/python3.13/site-packages/transformers/generation/utils.py", line 2863, in _sample
outputs = self._prefill(input_ids, generation_config, model_kwargs)
File "/venv/lib/python3.13/site-packages/transformers/generation/utils.py", line 3857, in _prefill
return self(**model_inputs, return_dict=True)
File "/venv/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
return forward_call(*args, **kwargs)
File "/venv/lib/python3.13/site-packages/transformers/utils/generic.py", line 841, in wrapper
output = func(self, *args, **kwargs)
File "/venv/lib/python3.13/site-packages/transformers/models/glm_moe_dsa/modeling_glm_moe_dsa.py", line 864, in forward
outputs: BaseModelOutputWithPast = self.model(
~~~~~~~~~~^
input_ids=input_ids,
^^^^^^^^^^^^^^^^^^^^
...<6 lines>...
**kwargs,
^^^^^^^^^
)
^
File "/venv/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
return forward_call(*args, **kwargs)
File "/venv/lib/python3.13/site-packages/transformers/utils/generic.py", line 915, in wrapper
output = func(self, *args, **kwargs)
File "/venv/lib/python3.13/site-packages/transformers/utils/output_capturing.py", line 253, in wrapper
outputs = func(self, *args, **kwargs)
File "/venv/lib/python3.13/site-packages/transformers/models/glm_moe_dsa/modeling_glm_moe_dsa.py", line 799, in forward
hidden_states = decoder_layer(
hidden_states,
...<6 lines>...
**kwargs,
)
File "/venv/lib/python3.13/site-packages/transformers/modeling_layers.py", line 93, in __call__
return super().__call__(*args, **kwargs)
~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
return forward_call(*args, **kwargs)
File "/venv/lib/python3.13/site-packages/transformers/models/glm_moe_dsa/modeling_glm_moe_dsa.py", line 614, in forward
hidden_states, _ = self.self_attn(
~~~~~~~~~~~~~~^
hidden_states=hidden_states,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...<6 lines>...
**kwargs,
^^^^^^^^^
)
^
File "/venv/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
return forward_call(*args, **kwargs)
File "/venv/lib/python3.13/site-packages/transformers/models/glm_moe_dsa/modeling_glm_moe_dsa.py", line 398, in forward
topk_indices = self.indexer(
hidden_states,
...<3 lines>...
use_cache=past_key_values is not None,
) # [B, S, topk]
File "/venv/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.13/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
return forward_call(*args, **kwargs)
File "/venv/lib/python3.13/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
return func(*args, **kwargs)
File "/venv/lib/python3.13/site-packages/transformers/models/glm_moe_dsa/modeling_glm_moe_dsa.py", line 181, in forward
q_pe, q_nope = torch.split(q, [self.qk_rope_head_dim, self.head_dim - self.qk_rope_head_dim], dim=-1)
~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.13/site-packages/torch/functional.py", line 173, in split
return tensor.split(split_size_or_sections, dim)
~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.13/site-packages/torch/_tensor.py", line 1066, in split
return torch._VF.split_with_sizes(
~~~~~~~~~~~~~~~~~~~~~~~~~~^
self,
^^^^^
...<2 lines>...
dim,
^^^^
)
^
RuntimeError: split_with_sizes expects split_sizes have only non-negative entries, but got split_sizes=[64, -32]
Process finished with exit code 1
Environment:
accelerate==1.12.0
aiohappyeyeballs==2.6.1
aiohttp==3.13.3
aiosignal==1.4.0
annotated-doc==0.0.4
anyio==4.12.1
attrs==25.4.0
black==26.1.0
certifi==2026.1.4
charset-normalizer==3.4.4
click==8.3.1
cuda-bindings==12.9.4
cuda-pathfinder==1.3.5
datasets==4.5.0
dill==0.4.0
filelock==3.24.3
frozenlist==1.8.0
fsspec==2025.10.0
h11==0.16.0
hf-xet==1.3.0
httpcore==1.0.9
httpx==0.28.1
huggingface_hub==1.4.1
idna==3.11
iniconfig==2.3.0
Jinja2==3.1.6
markdown-it-py==4.0.0
MarkupSafe==3.0.3
mdurl==0.1.2
mpmath==1.3.0
multidict==6.7.1
multiprocess==0.70.18
mypy_extensions==1.1.0
networkx==3.6.1
ninja==1.13.0
numpy==2.2.6
nvidia-cublas-cu12==12.8.4.1
nvidia-cuda-cupti-cu12==12.8.90
nvidia-cuda-nvrtc-cu12==12.8.93
nvidia-cuda-runtime-cu12==12.8.90
nvidia-cudnn-cu12==9.10.2.21
nvidia-cufft-cu12==11.3.3.83
nvidia-cufile-cu12==1.13.1.3
nvidia-curand-cu12==10.3.9.90
nvidia-cusolver-cu12==11.7.3.90
nvidia-cusparse-cu12==12.5.8.93
nvidia-cusparselt-cu12==0.7.1
nvidia-nccl-cu12==2.27.5
nvidia-nvjitlink-cu12==12.8.93
nvidia-nvshmem-cu12==3.4.5
nvidia-nvtx-cu12==12.8.90
packaging==26.0
pandas==3.0.1
pathspec==1.0.4
platformdirs==4.9.2
pluggy==1.6.0
propcache==0.4.1
psutil==7.2.2
pyarrow==23.0.1
Pygments==2.19.2
pytest==9.0.2
python-dateutil==2.9.0.post0
pytokens==0.4.1
PyYAML==6.0.3
regex==2026.2.19
requests==2.32.5
rich==14.3.3
ruff==0.14.14
safetensors==0.7.0
setuptools==82.0.0
shellingham==1.5.4
six==1.17.0
sympy==1.14.0
tokenizers==0.22.2
torch==2.10.0
tqdm==4.67.3
transformers==5.2.0
triton==3.6.0
typer==0.24.1
typer-slim==0.24.0
typing_extensions==4.15.0
urllib3==2.6.3
xxhash==3.6.0
yarl==1.22.0
zstandard==0.25.0
Perhaps something wrong with the config?
Hi Nikita, thanks for your investigation! I've updated the models
yujiepan changed discussion status to closed