Instructions to use zai-org/GLM-4.7-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use zai-org/GLM-4.7-Flash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="zai-org/GLM-4.7-Flash")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash")
model = AutoModelForCausalLM.from_pretrained("zai-org/GLM-4.7-Flash")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use zai-org/GLM-4.7-Flash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "zai-org/GLM-4.7-Flash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/zai-org/GLM-4.7-Flash

SGLang

How to use zai-org/GLM-4.7-Flash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "zai-org/GLM-4.7-Flash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "zai-org/GLM-4.7-Flash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "zai-org/GLM-4.7-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use zai-org/GLM-4.7-Flash with Docker Model Runner:
```
docker model run hf.co/zai-org/GLM-4.7-Flash
```

Problem with model

#22

by dwojcik - opened Jan 20

Discussion

dwojcik

Jan 20

I have problem running the model in container. It seems that transformers library is not supporting it? What I am doing wrong?
The main error is:

ValueError: There is no module or parameter named 'model.layers.1.mlp.gate.e_score_correction_bias' in TransformersMoEForCausalLM
(EngineCore_DP0 pid=814) Traceback (most recent call last):

Full log:

Looking in indexes: https://pypi.org/simple, https://wheels.vllm.ai/nightly
Requirement already satisfied: vllm in /usr/local/lib/python3.12/dist-packages (0.13.0)
Collecting vllm
  Downloading vllm-0.14.0-cp38-abi3-manylinux_2_31_x86_64.whl.metadata (8.9 kB)
Requirement already satisfied: regex in /usr/local/lib/python3.12/dist-packages (from vllm) (2025.11.3)
Requirement already satisfied: cachetools in /usr/local/lib/python3.12/dist-packages (from vllm) (6.2.4)
Requirement already satisfied: psutil in /usr/local/lib/python3.12/dist-packages (from vllm) (7.1.3)
Requirement already satisfied: sentencepiece in /usr/local/lib/python3.12/dist-packages (from vllm) (0.2.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (from vllm) (2.2.6)
Requirement already satisfied: requests>=2.26.0 in /usr/local/lib/python3.12/dist-packages (from vllm) (2.32.5)
Requirement already satisfied: tqdm in /usr/local/lib/python3.12/dist-packages (from vllm) (4.67.1)
Requirement already satisfied: blake3 in /usr/local/lib/python3.12/dist-packages (from vllm) (1.0.8)
Requirement already satisfied: py-cpuinfo in /usr/local/lib/python3.12/dist-packages (from vllm) (9.0.0)
Requirement already satisfied: transformers<5,>=4.56.0 in /usr/local/lib/python3.12/dist-packages (from vllm) (4.57.3)
Requirement already satisfied: tokenizers>=0.21.1 in /usr/local/lib/python3.12/dist-packages (from vllm) (0.22.1)
Requirement already satisfied: protobuf>=6.30.0 in /usr/local/lib/python3.12/dist-packages (from vllm) (6.33.2)
Requirement already satisfied: fastapi>=0.115.0 in /usr/local/lib/python3.12/dist-packages (from fastapi[standard]>=0.115.0->vllm) (0.125.0)
Requirement already satisfied: aiohttp in /usr/local/lib/python3.12/dist-packages (from vllm) (3.13.2)
Requirement already satisfied: openai>=1.99.1 in /usr/local/lib/python3.12/dist-packages (from vllm) (2.13.0)
Requirement already satisfied: pydantic>=2.12.0 in /usr/local/lib/python3.12/dist-packages (from vllm) (2.12.5)
Requirement already satisfied: prometheus_client>=0.18.0 in /usr/local/lib/python3.12/dist-packages (from vllm) (0.23.1)
Requirement already satisfied: pillow in /usr/local/lib/python3.12/dist-packages (from vllm) (12.0.0)
Requirement already satisfied: prometheus-fastapi-instrumentator>=7.0.0 in /usr/local/lib/python3.12/dist-packages (from vllm) (7.1.0)
Requirement already satisfied: tiktoken>=0.6.0 in /usr/local/lib/python3.12/dist-packages (from vllm) (0.12.0)
Requirement already satisfied: lm-format-enforcer==0.11.3 in /usr/local/lib/python3.12/dist-packages (from vllm) (0.11.3)
Requirement already satisfied: llguidance<1.4.0,>=1.3.0 in /usr/local/lib/python3.12/dist-packages (from vllm) (1.3.0)
Requirement already satisfied: outlines_core==0.2.11 in /usr/local/lib/python3.12/dist-packages (from vllm) (0.2.11)
Requirement already satisfied: diskcache==5.6.3 in /usr/local/lib/python3.12/dist-packages (from vllm) (5.6.3)
Requirement already satisfied: lark==1.2.2 in /usr/local/lib/python3.12/dist-packages (from vllm) (1.2.2)
Collecting xgrammar==0.1.29 (from vllm)
  Downloading xgrammar-0.1.29-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.8 kB)
Requirement already satisfied: typing_extensions>=4.10 in /usr/local/lib/python3.12/dist-packages (from vllm) (4.15.0)
Requirement already satisfied: filelock>=3.16.1 in /usr/local/lib/python3.12/dist-packages (from vllm) (3.20.1)
Requirement already satisfied: partial-json-parser in /usr/local/lib/python3.12/dist-packages (from vllm) (0.2.1.1.post7)
Requirement already satisfied: pyzmq>=25.0.0 in /usr/local/lib/python3.12/dist-packages (from vllm) (27.1.0)
Requirement already satisfied: msgspec in /usr/local/lib/python3.12/dist-packages (from vllm) (0.20.0)
Requirement already satisfied: gguf>=0.17.0 in /usr/local/lib/python3.12/dist-packages (from vllm) (0.17.1)
Collecting mistral_common>=1.8.8 (from mistral_common[image]>=1.8.8->vllm)
  Downloading mistral_common-1.8.8-py3-none-any.whl.metadata (5.3 kB)
Requirement already satisfied: opencv-python-headless>=4.11.0 in /usr/local/lib/python3.12/dist-packages (from vllm) (4.12.0.88)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.12/dist-packages (from vllm) (6.0.3)
Requirement already satisfied: six>=1.16.0 in /usr/local/lib/python3.12/dist-packages (from vllm) (1.17.0)
Requirement already satisfied: setuptools<81.0.0,>=77.0.3 in /usr/local/lib/python3.12/dist-packages (from vllm) (80.9.0)
Requirement already satisfied: einops in /usr/local/lib/python3.12/dist-packages (from vllm) (0.8.1)
Collecting compressed-tensors==0.13.0 (from vllm)
  Downloading compressed_tensors-0.13.0-py3-none-any.whl.metadata (7.0 kB)
Requirement already satisfied: depyf==0.20.0 in /usr/local/lib/python3.12/dist-packages (from vllm) (0.20.0)
Requirement already satisfied: cloudpickle in /usr/local/lib/python3.12/dist-packages (from vllm) (3.1.2)
Requirement already satisfied: watchfiles in /usr/local/lib/python3.12/dist-packages (from vllm) (1.1.1)
Requirement already satisfied: python-json-logger in /usr/local/lib/python3.12/dist-packages (from vllm) (4.0.0)
Requirement already satisfied: ninja in /usr/local/lib/python3.12/dist-packages (from vllm) (1.13.0)
Requirement already satisfied: pybase64 in /usr/local/lib/python3.12/dist-packages (from vllm) (1.4.3)
Requirement already satisfied: cbor2 in /usr/local/lib/python3.12/dist-packages (from vllm) (5.7.1)
Requirement already satisfied: ijson in /usr/local/lib/python3.12/dist-packages (from vllm) (3.4.0.post0)
Requirement already satisfied: setproctitle in /usr/local/lib/python3.12/dist-packages (from vllm) (1.3.7)
Requirement already satisfied: openai-harmony>=0.0.3 in /usr/local/lib/python3.12/dist-packages (from vllm) (0.0.8)
Requirement already satisfied: anthropic==0.71.0 in /usr/local/lib/python3.12/dist-packages (from vllm) (0.71.0)
Requirement already satisfied: model-hosting-container-standards<1.0.0,>=0.1.10 in /usr/local/lib/python3.12/dist-packages (from vllm) (0.1.12)
Requirement already satisfied: mcp in /usr/local/lib/python3.12/dist-packages (from vllm) (1.24.0)
Collecting grpcio>=1.76.0 (from vllm)
  Downloading grpcio-1.78.0rc2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.8 kB)
Collecting grpcio-reflection>=1.76.0 (from vllm)
  Downloading grpcio_reflection-1.78.0rc2-py3-none-any.whl.metadata (1.2 kB)
Requirement already satisfied: numba==0.61.2 in /usr/local/lib/python3.12/dist-packages (from vllm) (0.61.2)
Requirement already satisfied: ray>=2.48.0 in /usr/local/lib/python3.12/dist-packages (from ray[cgraph]>=2.48.0->vllm) (2.52.1)
Collecting torch==2.9.1 (from vllm)
  Downloading torch-2.9.1-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (30 kB)
Collecting torchaudio==2.9.1 (from vllm)
  Downloading torchaudio-2.9.1-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (6.9 kB)
Collecting torchvision==0.24.1 (from vllm)
  Downloading torchvision-0.24.1-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (5.9 kB)
Requirement already satisfied: flashinfer-python==0.5.3 in /usr/local/lib/python3.12/dist-packages (from vllm) (0.5.3)
Requirement already satisfied: anyio<5,>=3.5.0 in /usr/local/lib/python3.12/dist-packages (from anthropic==0.71.0->vllm) (4.12.0)
Requirement already satisfied: distro<2,>=1.7.0 in /usr/local/lib/python3.12/dist-packages (from anthropic==0.71.0->vllm) (1.9.0)
Requirement already satisfied: docstring-parser<1,>=0.15 in /usr/local/lib/python3.12/dist-packages (from anthropic==0.71.0->vllm) (0.17.0)
Requirement already satisfied: httpx<1,>=0.25.0 in /usr/local/lib/python3.12/dist-packages (from anthropic==0.71.0->vllm) (0.28.1)
Requirement already satisfied: jiter<1,>=0.4.0 in /usr/local/lib/python3.12/dist-packages (from anthropic==0.71.0->vllm) (0.12.0)
Requirement already satisfied: sniffio in /usr/local/lib/python3.12/dist-packages (from anthropic==0.71.0->vllm) (1.3.1)
Requirement already satisfied: loguru in /usr/local/lib/python3.12/dist-packages (from compressed-tensors==0.13.0->vllm) (0.7.3)
Requirement already satisfied: astor in /usr/local/lib/python3.12/dist-packages (from depyf==0.20.0->vllm) (0.8.1)
Requirement already satisfied: dill in /usr/local/lib/python3.12/dist-packages (from depyf==0.20.0->vllm) (0.4.0)
Requirement already satisfied: apache-tvm-ffi<0.2,>=0.1 in /usr/local/lib/python3.12/dist-packages (from flashinfer-python==0.5.3->vllm) (0.1.6)
Requirement already satisfied: click in /usr/local/lib/python3.12/dist-packages (from flashinfer-python==0.5.3->vllm) (8.2.1)
Requirement already satisfied: nvidia-cudnn-frontend>=1.13.0 in /usr/local/lib/python3.12/dist-packages (from flashinfer-python==0.5.3->vllm) (1.16.0)
Requirement already satisfied: nvidia-cutlass-dsl>=4.2.1 in /usr/local/lib/python3.12/dist-packages (from flashinfer-python==0.5.3->vllm) (4.3.3)
Requirement already satisfied: nvidia-ml-py in /usr/local/lib/python3.12/dist-packages (from flashinfer-python==0.5.3->vllm) (13.590.44)
Requirement already satisfied: packaging>=24.2 in /usr/local/lib/python3.12/dist-packages (from flashinfer-python==0.5.3->vllm) (25.0)
Requirement already satisfied: tabulate in /usr/local/lib/python3.12/dist-packages (from flashinfer-python==0.5.3->vllm) (0.9.0)
Requirement already satisfied: interegular>=0.3.2 in /usr/local/lib/python3.12/dist-packages (from lm-format-enforcer==0.11.3->vllm) (0.3.3)
Requirement already satisfied: llvmlite<0.45,>=0.44.0dev0 in /usr/local/lib/python3.12/dist-packages (from numba==0.61.2->vllm) (0.44.0)
Requirement already satisfied: sympy>=1.13.3 in /usr/local/lib/python3.12/dist-packages (from torch==2.9.1->vllm) (1.14.0)
Requirement already satisfied: networkx>=2.5.1 in /usr/local/lib/python3.12/dist-packages (from torch==2.9.1->vllm) (3.6.1)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.12/dist-packages (from torch==2.9.1->vllm) (3.1.6)
Requirement already satisfied: fsspec>=0.8.5 in /usr/local/lib/python3.12/dist-packages (from torch==2.9.1->vllm) (2025.12.0)
Collecting nvidia-cuda-nvrtc-cu12==12.8.93 (from torch==2.9.1->vllm)
  Downloading nvidia_cuda_nvrtc_cu12-12.8.93-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB)
Collecting nvidia-cuda-runtime-cu12==12.8.90 (from torch==2.9.1->vllm)
  Downloading nvidia_cuda_runtime_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB)
Collecting nvidia-cuda-cupti-cu12==12.8.90 (from torch==2.9.1->vllm)
  Downloading nvidia_cuda_cupti_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB)
Requirement already satisfied: nvidia-cudnn-cu12==9.10.2.21 in /usr/local/lib/python3.12/dist-packages (from torch==2.9.1->vllm) (9.10.2.21)
Collecting nvidia-cublas-cu12==12.8.4.1 (from torch==2.9.1->vllm)
  Downloading nvidia_cublas_cu12-12.8.4.1-py3-none-manylinux_2_27_x86_64.whl.metadata (1.7 kB)
Collecting nvidia-cufft-cu12==11.3.3.83 (from torch==2.9.1->vllm)
  Downloading nvidia_cufft_cu12-11.3.3.83-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB)
Collecting nvidia-curand-cu12==10.3.9.90 (from torch==2.9.1->vllm)
  Downloading nvidia_curand_cu12-10.3.9.90-py3-none-manylinux_2_27_x86_64.whl.metadata (1.7 kB)
Collecting nvidia-cusolver-cu12==11.7.3.90 (from torch==2.9.1->vllm)
  Downloading nvidia_cusolver_cu12-11.7.3.90-py3-none-manylinux_2_27_x86_64.whl.metadata (1.8 kB)
Collecting nvidia-cusparse-cu12==12.5.8.93 (from torch==2.9.1->vllm)
  Downloading nvidia_cusparse_cu12-12.5.8.93-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.8 kB)
Requirement already satisfied: nvidia-cusparselt-cu12==0.7.1 in /usr/local/lib/python3.12/dist-packages (from torch==2.9.1->vllm) (0.7.1)
Requirement already satisfied: nvidia-nccl-cu12==2.27.5 in /usr/local/lib/python3.12/dist-packages (from torch==2.9.1->vllm) (2.27.5)
Requirement already satisfied: nvidia-nvshmem-cu12==3.3.20 in /usr/local/lib/python3.12/dist-packages (from torch==2.9.1->vllm) (3.3.20)
Collecting nvidia-nvtx-cu12==12.8.90 (from torch==2.9.1->vllm)
  Downloading nvidia_nvtx_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.8 kB)
Collecting nvidia-nvjitlink-cu12==12.8.93 (from torch==2.9.1->vllm)
  Downloading nvidia_nvjitlink_cu12-12.8.93-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB)
Collecting nvidia-cufile-cu12==1.13.1.3 (from torch==2.9.1->vllm)
  Downloading nvidia_cufile_cu12-1.13.1.3-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB)
Collecting triton==3.5.1 (from torch==2.9.1->vllm)
  Downloading triton-3.5.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.7 kB)
Requirement already satisfied: idna>=2.8 in /usr/local/lib/python3.12/dist-packages (from anyio<5,>=3.5.0->anthropic==0.71.0->vllm) (3.11)
Requirement already satisfied: certifi in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.25.0->anthropic==0.71.0->vllm) (2025.11.12)
Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.25.0->anthropic==0.71.0->vllm) (1.0.9)
Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.12/dist-packages (from httpcore==1.*->httpx<1,>=0.25.0->anthropic==0.71.0->vllm) (0.16.0)
Requirement already satisfied: jmespath in /usr/local/lib/python3.12/dist-packages (from model-hosting-container-standards<1.0.0,>=0.1.10->vllm) (1.0.1)
Requirement already satisfied: starlette>=0.49.1 in /usr/local/lib/python3.12/dist-packages (from model-hosting-container-standards<1.0.0,>=0.1.10->vllm) (0.50.0)
Requirement already satisfied: supervisor>=4.2.0 in /usr/local/lib/python3.12/dist-packages (from model-hosting-container-standards<1.0.0,>=0.1.10->vllm) (4.3.0)
Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.12/dist-packages (from pydantic>=2.12.0->vllm) (0.7.0)
Requirement already satisfied: pydantic-core==2.41.5 in /usr/local/lib/python3.12/dist-packages (from pydantic>=2.12.0->vllm) (2.41.5)
Requirement already satisfied: typing-inspection>=0.4.2 in /usr/local/lib/python3.12/dist-packages (from pydantic>=2.12.0->vllm) (0.4.2)
Requirement already satisfied: huggingface-hub<1.0,>=0.34.0 in /usr/local/lib/python3.12/dist-packages (from transformers<5,>=4.56.0->vllm) (0.36.0)
Requirement already satisfied: safetensors>=0.4.3 in /usr/local/lib/python3.12/dist-packages (from transformers<5,>=4.56.0->vllm) (0.7.0)
Requirement already satisfied: hf-xet<2.0.0,>=1.1.3 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<1.0,>=0.34.0->transformers<5,>=4.56.0->vllm) (1.2.0)
Requirement already satisfied: annotated-doc>=0.0.2 in /usr/local/lib/python3.12/dist-packages (from fastapi>=0.115.0->fastapi[standard]>=0.115.0->vllm) (0.0.4)
Requirement already satisfied: fastapi-cli>=0.0.8 in /usr/local/lib/python3.12/dist-packages (from fastapi-cli[standard[]>=0.0.8; extra == "standard"->fastapi[standard]>=0.115.0->vllm) (0.0.16)
Requirement already satisfied: python-multipart>=0.0.18 in /usr/local/lib/python3.12/dist-packages (from fastapi[standard]>=0.115.0->vllm) (0.0.21)
Requirement already satisfied: email-validator>=2.0.0 in /usr/local/lib/python3.12/dist-packages (from fastapi[standard]>=0.115.0->vllm) (2.3.0)
Requirement already satisfied: uvicorn>=0.12.0 in /usr/local/lib/python3.12/dist-packages (from uvicorn[standard[]>=0.12.0; extra == "standard"->fastapi[standard]>=0.115.0->vllm) (0.38.0)
Requirement already satisfied: dnspython>=2.0.0 in /usr/local/lib/python3.12/dist-packages (from email-validator>=2.0.0->fastapi[standard]>=0.115.0->vllm) (2.8.0)
Requirement already satisfied: typer>=0.15.1 in /usr/local/lib/python3.12/dist-packages (from fastapi-cli>=0.0.8->fastapi-cli[standard[]>=0.0.8; extra == "standard"->fastapi[standard]>=0.115.0->vllm) (0.20.0)
Requirement already satisfied: rich-toolkit>=0.14.8 in /usr/local/lib/python3.12/dist-packages (from fastapi-cli>=0.0.8->fastapi-cli[standard[]>=0.0.8; extra == "standard"->fastapi[standard]>=0.115.0->vllm) (0.17.1)
Requirement already satisfied: fastapi-cloud-cli>=0.1.1 in /usr/local/lib/python3.12/dist-packages (from fastapi-cli[standard[]>=0.0.8; extra == "standard"->fastapi[standard]>=0.115.0->vllm) (0.7.0)
Requirement already satisfied: rignore>=0.5.1 in /usr/local/lib/python3.12/dist-packages (from fastapi-cloud-cli>=0.1.1->fastapi-cli[standard[]>=0.0.8; extra == "standard"->fastapi[standard]>=0.115.0->vllm) (0.7.6)
Requirement already satisfied: sentry-sdk>=2.20.0 in /usr/local/lib/python3.12/dist-packages (from fastapi-cloud-cli>=0.1.1->fastapi-cli[standard[]>=0.0.8; extra == "standard"->fastapi[standard]>=0.115.0->vllm) (2.48.0)
Requirement already satisfied: fastar>=0.8.0 in /usr/local/lib/python3.12/dist-packages (from fastapi-cloud-cli>=0.1.1->fastapi-cli[standard[]>=0.0.8; extra == "standard"->fastapi[standard]>=0.115.0->vllm) (0.8.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.12/dist-packages (from jinja2->torch==2.9.1->vllm) (3.0.3)
Requirement already satisfied: jsonschema>=4.21.1 in /usr/local/lib/python3.12/dist-packages (from mistral_common>=1.8.8->mistral_common[image]>=1.8.8->vllm) (4.25.1)
Requirement already satisfied: pydantic-extra-types>=2.10.5 in /usr/local/lib/python3.12/dist-packages (from pydantic-extra-types[pycountry[]>=2.10.5->mistral_common>=1.8.8->mistral_common[image]>=1.8.8->vllm) (2.10.6)
Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.12/dist-packages (from jsonschema>=4.21.1->mistral_common>=1.8.8->mistral_common[image]>=1.8.8->vllm) (25.4.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.12/dist-packages (from jsonschema>=4.21.1->mistral_common>=1.8.8->mistral_common[image]>=1.8.8->vllm) (2025.9.1)
Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.12/dist-packages (from jsonschema>=4.21.1->mistral_common>=1.8.8->mistral_common[image]>=1.8.8->vllm) (0.37.0)
Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.12/dist-packages (from jsonschema>=4.21.1->mistral_common>=1.8.8->mistral_common[image]>=1.8.8->vllm) (0.30.0)
Requirement already satisfied: cuda-python>=12.8 in /usr/local/lib/python3.12/dist-packages (from nvidia-cutlass-dsl>=4.2.1->flashinfer-python==0.5.3->vllm) (13.1.1)
Requirement already satisfied: cuda-bindings~=13.1.1 in /usr/local/lib/python3.12/dist-packages (from cuda-python>=12.8->nvidia-cutlass-dsl>=4.2.1->flashinfer-python==0.5.3->vllm) (13.1.1)
Requirement already satisfied: cuda-pathfinder~=1.1 in /usr/local/lib/python3.12/dist-packages (from cuda-python>=12.8->nvidia-cutlass-dsl>=4.2.1->flashinfer-python==0.5.3->vllm) (1.3.3)
Requirement already satisfied: pycountry>=23 in /usr/local/lib/python3.12/dist-packages (from pydantic-extra-types[pycountry[]>=2.10.5->mistral_common>=1.8.8->mistral_common[image]>=1.8.8->vllm) (24.6.1)
Requirement already satisfied: msgpack<2.0.0,>=1.0.0 in /usr/local/lib/python3.12/dist-packages (from ray>=2.48.0->ray[cgraph]>=2.48.0->vllm) (1.1.2)
Requirement already satisfied: cupy-cuda12x in /usr/local/lib/python3.12/dist-packages (from ray[cgraph]>=2.48.0->vllm) (13.6.0)
Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests>=2.26.0->vllm) (3.4.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests>=2.26.0->vllm) (2.6.2)
Requirement already satisfied: rich>=13.7.1 in /usr/local/lib/python3.12/dist-packages (from rich-toolkit>=0.14.8->fastapi-cli>=0.0.8->fastapi-cli[standard[]>=0.0.8; extra == "standard"->fastapi[standard]>=0.115.0->vllm) (14.2.0)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.12/dist-packages (from rich>=13.7.1->rich-toolkit>=0.14.8->fastapi-cli>=0.0.8->fastapi-cli[standard[]>=0.0.8; extra == "standard"->fastapi[standard]>=0.115.0->vllm) (4.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.12/dist-packages (from rich>=13.7.1->rich-toolkit>=0.14.8->fastapi-cli>=0.0.8->fastapi-cli[standard[]>=0.0.8; extra == "standard"->fastapi[standard]>=0.115.0->vllm) (2.19.2)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.12/dist-packages (from markdown-it-py>=2.2.0->rich>=13.7.1->rich-toolkit>=0.14.8->fastapi-cli>=0.0.8->fastapi-cli[standard[]>=0.0.8; extra == "standard"->fastapi[standard]>=0.115.0->vllm) (0.1.2)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.12/dist-packages (from sympy>=1.13.3->torch==2.9.1->vllm) (1.3.0)
Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.12/dist-packages (from typer>=0.15.1->fastapi-cli>=0.0.8->fastapi-cli[standard[]>=0.0.8; extra == "standard"->fastapi[standard]>=0.115.0->vllm) (1.5.4)
Requirement already satisfied: httptools>=0.6.3 in /usr/local/lib/python3.12/dist-packages (from uvicorn[standard[]>=0.12.0; extra == "standard"->fastapi[standard]>=0.115.0->vllm) (0.7.1)
Requirement already satisfied: python-dotenv>=0.13 in /usr/local/lib/python3.12/dist-packages (from uvicorn[standard[]>=0.12.0; extra == "standard"->fastapi[standard]>=0.115.0->vllm) (1.2.1)
Requirement already satisfied: uvloop>=0.15.1 in /usr/local/lib/python3.12/dist-packages (from uvicorn[standard[]>=0.12.0; extra == "standard"->fastapi[standard]>=0.115.0->vllm) (0.22.1)
Requirement already satisfied: websockets>=10.4 in /usr/local/lib/python3.12/dist-packages (from uvicorn[standard[]>=0.12.0; extra == "standard"->fastapi[standard]>=0.115.0->vllm) (15.0.1)
Requirement already satisfied: aiohappyeyeballs>=2.5.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp->vllm) (2.6.1)
Requirement already satisfied: aiosignal>=1.4.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp->vllm) (1.4.0)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.12/dist-packages (from aiohttp->vllm) (1.8.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.12/dist-packages (from aiohttp->vllm) (6.7.0)
Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp->vllm) (0.4.1)
Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp->vllm) (1.22.0)
Requirement already satisfied: fastrlock>=0.5 in /usr/local/lib/python3.12/dist-packages (from cupy-cuda12x->ray[cgraph]>=2.48.0->vllm) (0.8.3)
Requirement already satisfied: httpx-sse>=0.4 in /usr/local/lib/python3.12/dist-packages (from mcp->vllm) (0.4.3)
Requirement already satisfied: pydantic-settings>=2.5.2 in /usr/local/lib/python3.12/dist-packages (from mcp->vllm) (2.12.0)
Requirement already satisfied: pyjwt>=2.10.1 in /usr/local/lib/python3.12/dist-packages (from pyjwt[crypto]>=2.10.1->mcp->vllm) (2.10.1)
Requirement already satisfied: sse-starlette>=1.6.1 in /usr/local/lib/python3.12/dist-packages (from mcp->vllm) (3.0.4)
Requirement already satisfied: cryptography>=3.4.0 in /usr/local/lib/python3.12/dist-packages (from pyjwt[crypto]>=2.10.1->mcp->vllm) (46.0.3)
Requirement already satisfied: cffi>=2.0.0 in /usr/local/lib/python3.12/dist-packages (from cryptography>=3.4.0->pyjwt[crypto]>=2.10.1->mcp->vllm) (2.0.0)
Requirement already satisfied: pycparser in /usr/local/lib/python3.12/dist-packages (from cffi>=2.0.0->cryptography>=3.4.0->pyjwt[crypto]>=2.10.1->mcp->vllm) (2.23)
Downloading vllm-0.14.0-cp38-abi3-manylinux_2_31_x86_64.whl (495.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 495.4/495.4 MB 19.4 MB/s  0:00:22
Downloading compressed_tensors-0.13.0-py3-none-any.whl (192 kB)
Downloading torch-2.9.1-cp312-cp312-manylinux_2_28_x86_64.whl (899.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 899.7/899.7 MB 19.0 MB/s  0:00:41
Downloading nvidia_cublas_cu12-12.8.4.1-py3-none-manylinux_2_27_x86_64.whl (594.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 594.3/594.3 MB 20.4 MB/s  0:00:27
Downloading nvidia_cuda_cupti_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (10.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.2/10.2 MB 21.4 MB/s  0:00:00
Downloading nvidia_cuda_nvrtc_cu12-12.8.93-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (88.0 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88.0/88.0 MB 24.0 MB/s  0:00:03
Downloading nvidia_cuda_runtime_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (954 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 954.8/954.8 kB 21.2 MB/s  0:00:00
Downloading nvidia_cufft_cu12-11.3.3.83-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (193.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 193.1/193.1 MB 21.6 MB/s  0:00:08
Downloading nvidia_cufile_cu12-1.13.1.3-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 16.7 MB/s  0:00:00
Downloading nvidia_curand_cu12-10.3.9.90-py3-none-manylinux_2_27_x86_64.whl (63.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.6/63.6 MB 20.7 MB/s  0:00:03
Downloading nvidia_cusolver_cu12-11.7.3.90-py3-none-manylinux_2_27_x86_64.whl (267.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 267.5/267.5 MB 21.6 MB/s  0:00:12
Downloading nvidia_cusparse_cu12-12.5.8.93-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (288.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 288.2/288.2 MB 21.6 MB/s  0:00:13
Downloading nvidia_nvjitlink_cu12-12.8.93-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.3/39.3 MB 21.4 MB/s  0:00:01
Downloading nvidia_nvtx_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89 kB)
Downloading torchaudio-2.9.1-cp312-cp312-manylinux_2_28_x86_64.whl (2.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 20.5 MB/s  0:00:00
Downloading torchvision-0.24.1-cp312-cp312-manylinux_2_28_x86_64.whl (8.0 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.0/8.0 MB 22.0 MB/s  0:00:00
Downloading triton-3.5.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (170.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 170.5/170.5 MB 21.6 MB/s  0:00:07
Downloading xgrammar-0.1.29-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.9 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34.9/34.9 MB 19.5 MB/s  0:00:01
Downloading grpcio-1.78.0rc2-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (6.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.7/6.7 MB 19.8 MB/s  0:00:00
Downloading grpcio_reflection-1.78.0rc2-py3-none-any.whl (22 kB)
Downloading mistral_common-1.8.8-py3-none-any.whl (6.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.5/6.5 MB 20.0 MB/s  0:00:00
Installing collected packages: triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, grpcio, nvidia-cusparse-cu12, nvidia-cufft-cu12, grpcio-reflection, nvidia-cusolver-cu12, torch, xgrammar, torchvision, torchaudio, mistral_common, compressed-tensors, vllm
  Attempting uninstall: triton
    Found existing installation: triton 3.5.0
    Uninstalling triton-3.5.0:
      Successfully uninstalled triton-3.5.0
  Attempting uninstall: nvidia-nvtx-cu12
    Found existing installation: nvidia-nvtx-cu12 12.9.79
    Uninstalling nvidia-nvtx-cu12-12.9.79:
      Successfully uninstalled nvidia-nvtx-cu12-12.9.79
  Attempting uninstall: nvidia-nvjitlink-cu12
    Found existing installation: nvidia-nvjitlink-cu12 12.9.86
    Uninstalling nvidia-nvjitlink-cu12-12.9.86:
      Successfully uninstalled nvidia-nvjitlink-cu12-12.9.86
  Attempting uninstall: nvidia-curand-cu12
    Found existing installation: nvidia-curand-cu12 10.3.10.19
    Uninstalling nvidia-curand-cu12-10.3.10.19:
      Successfully uninstalled nvidia-curand-cu12-10.3.10.19
  Attempting uninstall: nvidia-cufile-cu12
    Found existing installation: nvidia-cufile-cu12 1.14.1.1
    Uninstalling nvidia-cufile-cu12-1.14.1.1:
      Successfully uninstalled nvidia-cufile-cu12-1.14.1.1
  Attempting uninstall: nvidia-cuda-runtime-cu12
    Found existing installation: nvidia-cuda-runtime-cu12 12.9.79
    Uninstalling nvidia-cuda-runtime-cu12-12.9.79:
      Successfully uninstalled nvidia-cuda-runtime-cu12-12.9.79
  Attempting uninstall: nvidia-cuda-nvrtc-cu12
    Found existing installation: nvidia-cuda-nvrtc-cu12 12.9.86
    Uninstalling nvidia-cuda-nvrtc-cu12-12.9.86:
      Successfully uninstalled nvidia-cuda-nvrtc-cu12-12.9.86
  Attempting uninstall: nvidia-cuda-cupti-cu12
    Found existing installation: nvidia-cuda-cupti-cu12 12.9.79
    Uninstalling nvidia-cuda-cupti-cu12-12.9.79:
      Successfully uninstalled nvidia-cuda-cupti-cu12-12.9.79
  Attempting uninstall: nvidia-cublas-cu12
    Found existing installation: nvidia-cublas-cu12 12.9.1.4
    Uninstalling nvidia-cublas-cu12-12.9.1.4:
      Successfully uninstalled nvidia-cublas-cu12-12.9.1.4
  Attempting uninstall: nvidia-cusparse-cu12
    Found existing installation: nvidia-cusparse-cu12 12.5.10.65
    Uninstalling nvidia-cusparse-cu12-12.5.10.65:
      Successfully uninstalled nvidia-cusparse-cu12-12.5.10.65
  Attempting uninstall: nvidia-cufft-cu12
    Found existing installation: nvidia-cufft-cu12 11.4.1.4
    Uninstalling nvidia-cufft-cu12-11.4.1.4:
      Successfully uninstalled nvidia-cufft-cu12-11.4.1.4
  Attempting uninstall: nvidia-cusolver-cu12
    Found existing installation: nvidia-cusolver-cu12 11.7.5.82
    Uninstalling nvidia-cusolver-cu12-11.7.5.82:
      Successfully uninstalled nvidia-cusolver-cu12-11.7.5.82
  Attempting uninstall: torch
    Found existing installation: torch 2.9.0+cu129
    Uninstalling torch-2.9.0+cu129:
      Successfully uninstalled torch-2.9.0+cu129
  Attempting uninstall: xgrammar
    Found existing installation: xgrammar 0.1.27
    Uninstalling xgrammar-0.1.27:
      Successfully uninstalled xgrammar-0.1.27
  Attempting uninstall: torchvision
    Found existing installation: torchvision 0.24.0+cu129
    Uninstalling torchvision-0.24.0+cu129:
      Successfully uninstalled torchvision-0.24.0+cu129
  Attempting uninstall: torchaudio
    Found existing installation: torchaudio 2.9.0+cu129
    Uninstalling torchaudio-2.9.0+cu129:
      Successfully uninstalled torchaudio-2.9.0+cu129
  Attempting uninstall: mistral_common
    Found existing installation: mistral_common 1.8.6
    Uninstalling mistral_common-1.8.6:
      Successfully uninstalled mistral_common-1.8.6
  Attempting uninstall: compressed-tensors
    Found existing installation: compressed-tensors 0.12.2
    Uninstalling compressed-tensors-0.12.2:
      Successfully uninstalled compressed-tensors-0.12.2
  Attempting uninstall: vllm
    Found existing installation: vllm 0.13.0
    Uninstalling vllm-0.13.0:
      Successfully uninstalled vllm-0.13.0

Successfully installed compressed-tensors-0.13.0 grpcio-1.78.0rc2 grpcio-reflection-1.78.0rc2 mistral_common-1.8.8 nvidia-cublas-cu12-12.8.4.1 nvidia-cuda-cupti-cu12-12.8.90 nvidia-cuda-nvrtc-cu12-12.8.93 nvidia-cuda-runtime-cu12-12.8.90 nvidia-cufft-cu12-11.3.3.83 nvidia-cufile-cu12-1.13.1.3 nvidia-curand-cu12-10.3.9.90 nvidia-cusolver-cu12-11.7.3.90 nvidia-cusparse-cu12-12.5.8.93 nvidia-nvjitlink-cu12-12.8.93 nvidia-nvtx-cu12-12.8.90 torch-2.9.1 torchaudio-2.9.1 torchvision-0.24.1 triton-3.5.1 vllm-0.14.0 xgrammar-0.1.29
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-_zyvj5_e
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-_zyvj5_e
  Resolved https://github.com/huggingface/transformers.git to commit 9055ee4dd9ae6e258b8244faccfbdfa5c8e313e4
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Requirement already satisfied: filelock in /usr/local/lib/python3.12/dist-packages (from transformers==5.0.0.dev0) (3.20.1)
Collecting huggingface-hub<2.0,>=1.3.0 (from transformers==5.0.0.dev0)
  Downloading huggingface_hub-1.3.2-py3-none-any.whl.metadata (13 kB)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.12/dist-packages (from transformers==5.0.0.dev0) (2.2.6)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from transformers==5.0.0.dev0) (25.0)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.12/dist-packages (from transformers==5.0.0.dev0) (6.0.3)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.12/dist-packages (from transformers==5.0.0.dev0) (2025.11.3)
Requirement already satisfied: requests in /usr/local/lib/python3.12/dist-packages (from transformers==5.0.0.dev0) (2.32.5)
Requirement already satisfied: tokenizers<=0.23.0,>=0.22.0 in /usr/local/lib/python3.12/dist-packages (from transformers==5.0.0.dev0) (0.22.1)
Collecting typer-slim (from transformers==5.0.0.dev0)
  Downloading typer_slim-0.21.1-py3-none-any.whl.metadata (16 kB)
Requirement already satisfied: safetensors>=0.4.3 in /usr/local/lib/python3.12/dist-packages (from transformers==5.0.0.dev0) (0.7.0)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.12/dist-packages (from transformers==5.0.0.dev0) (4.67.1)
Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.3.0->transformers==5.0.0.dev0) (2025.12.0)
Requirement already satisfied: hf-xet<2.0.0,>=1.2.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.3.0->transformers==5.0.0.dev0) (1.2.0)
Requirement already satisfied: httpx<1,>=0.23.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.3.0->transformers==5.0.0.dev0) (0.28.1)
Requirement already satisfied: shellingham in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.3.0->transformers==5.0.0.dev0) (1.5.4)
Requirement already satisfied: typing-extensions>=4.1.0 in /usr/local/lib/python3.12/dist-packages (from huggingface-hub<2.0,>=1.3.0->transformers==5.0.0.dev0) (4.15.0)
Requirement already satisfied: anyio in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.3.0->transformers==5.0.0.dev0) (4.12.0)
Requirement already satisfied: certifi in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.3.0->transformers==5.0.0.dev0) (2025.11.12)
Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.3.0->transformers==5.0.0.dev0) (1.0.9)
Requirement already satisfied: idna in /usr/local/lib/python3.12/dist-packages (from httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.3.0->transformers==5.0.0.dev0) (3.11)
Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.12/dist-packages (from httpcore==1.*->httpx<1,>=0.23.0->huggingface-hub<2.0,>=1.3.0->transformers==5.0.0.dev0) (0.16.0)
Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests->transformers==5.0.0.dev0) (3.4.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests->transformers==5.0.0.dev0) (2.6.2)
Requirement already satisfied: click>=8.0.0 in /usr/local/lib/python3.12/dist-packages (from typer-slim->transformers==5.0.0.dev0) (8.2.1)
Downloading huggingface_hub-1.3.2-py3-none-any.whl (534 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 534.5/534.5 kB 11.5 MB/s  0:00:00
Downloading typer_slim-0.21.1-py3-none-any.whl (47 kB)
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml): started
  Building wheel for transformers (pyproject.toml): finished with status 'done'
  Created wheel for transformers: filename=transformers-5.0.0.dev0-py3-none-any.whl size=11155169 sha256=2caeb9112cca1e843c402fef3e508f25d372e63595b5256a0234211bd998fcc1
  Stored in directory: /tmp/pip-ephem-wheel-cache-f415onb8/wheels/54/cb/3f/83103de5575c534436d6a4686686dead458238dfaf1147e98d
Successfully built transformers
Installing collected packages: typer-slim, huggingface-hub, transformers
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 0.36.0
    Uninstalling huggingface-hub-0.36.0:
      Successfully uninstalled huggingface-hub-0.36.0
  Attempting uninstall: transformers
    Found existing installation: transformers 4.57.3
    Uninstalling transformers-4.57.3:
      Successfully uninstalled transformers-4.57.3

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
vllm 0.14.0 requires transformers<5,>=4.56.0, but you have transformers 5.0.0.dev0 which is incompatible.
Successfully installed huggingface-hub-1.3.2 transformers-5.0.0.dev0 typer-slim-0.21.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
WARNING 01-20 05:16:45 [argparse_utils.py:342] Found duplicate keys --tensor-parallel-size
(APIServer pid=611) INFO 01-20 05:16:45 [api_server.py:1272] vLLM API server version 0.14.0
(APIServer pid=611) INFO 01-20 05:16:45 [utils.py:263] non-default args: {'enable_auto_tool_choice': True, 'tool_call_parser': 'glm47', 'model': 'marksverdhei/GLM-4.7-Flash-FP8', 'trust_remote_code': True, 'max_model_len': 32768, 'quantization': 'fp8', 'served_model_name': ['glm47flash'], 'reasoning_parser': 'glm45', 'block_size': 16, 'gpu_memory_utilization': 0.92, 'kv_cache_dtype': 'fp8', 'enable_prefix_caching': True, 'max_num_batched_tokens': 8192, 'max_num_seqs': 64, 'enable_chunked_prefill': True, 'async_scheduling': True}
(APIServer pid=611) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=611) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=611) INFO 01-20 05:16:52 [model.py:530] Resolved architecture: TransformersMoEForCausalLM
(APIServer pid=611) INFO 01-20 05:16:52 [model.py:1545] Using max model len 32768
(APIServer pid=611) INFO 01-20 05:16:54 [cache.py:206] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=611) INFO 01-20 05:16:54 [scheduler.py:229] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=611) INFO 01-20 05:16:54 [vllm.py:630] Asynchronous scheduling is enabled.
(APIServer pid=611) INFO 01-20 05:16:54 [vllm.py:637] Disabling NCCL for DP synchronization when using async scheduling.
(APIServer pid=611) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(EngineCore_DP0 pid=814) INFO 01-20 05:17:03 [core.py:97[] Initializing a V1 LLM engine (v0.14.0) with config: model='marksverdhei/GLM-4.7-Flash-FP8', speculative_config=None, tokenizer='marksverdhei/GLM-4.7-Flash-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='glm45', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=glm47flash, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192[], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 128, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}
(EngineCore_DP0 pid=814) INFO 01-20 05:17:03 [parallel_state.py:1214] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.42.0.167:45911 backend=nccl
(EngineCore_DP0 pid=814) INFO 01-20 05:17:03 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=814) WARNING 01-20 05:17:04 [utils.py:184] TransformersMoEForCausalLM has no vLLM implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
(EngineCore_DP0 pid=814) INFO 01-20 05:17:04 [gpu_model_runner.py:3808] Starting to load model marksverdhei/GLM-4.7-Flash-FP8...
(EngineCore_DP0 pid=814) INFO 01-20 05:17:04 [base.py:134] Using Transformers modeling backend.
(EngineCore_DP0 pid=814) INFO 01-20 05:17:04 [fp8.py:126] DeepGEMM is disabled because the platform does not support it.
(EngineCore_DP0 pid=814) INFO 01-20 05:17:04 [fp8.py:149] Using Triton backend for FP8 MoE
(EngineCore_DP0 pid=814) INFO 01-20 05:17:20 [cuda.py:351] Using FLASHINFER attention backend out of potential backends: ('FLASHINFER', 'TRITON_ATTN')
(EngineCore_DP0 pid=814) INFO 01-20 05:17:21 [weight_utils.py:550] No model.safetensors.index.json found in remote.
(EngineCore_DP0 pid=814) 
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936] EngineCore failed to start.
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936] Traceback (most recent call last):
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 927, in run_engine_core
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 692, in __init__
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]     super().__init__(
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 106, in __init__
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=814) Process EngineCore_DP0:
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]     self._init_executor()
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]     self.driver_worker.load_model()
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 274, in load_model
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3827, in load_model
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]     self.model = model_loader.load_model(
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 58, in load_model
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]     self.load_weights(model, model_config)
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 288, in load_weights
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]     loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/transformers/base.py", line 492, in load_weights
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]     return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_model_load_weights
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]     return original_load_weights(auto_weight_loader, weights, mapper=mapper)
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 335, in load_weights
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]     autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]     yield from self._load_module(
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]     yield from self._load_module(
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]     yield from self._load_module(
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936[]   [Previous line repeated 2 more times]
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 319, in _load_module
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936]     raise ValueError(msg)
(EngineCore_DP0 pid=814) ERROR 01-20 05:17:22 [core.py:936] ValueError: There is no module or parameter named 'model.layers.1.mlp.gate.e_score_correction_bias' in TransformersMoEForCausalLM
(EngineCore_DP0 pid=814) Traceback (most recent call last):
(EngineCore_DP0 pid=814)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=814)     self.run()
(EngineCore_DP0 pid=814)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=814)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=814)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 940, in run_engine_core
(EngineCore_DP0 pid=814)     raise e
(EngineCore_DP0 pid=814)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 927, in run_engine_core
(EngineCore_DP0 pid=814)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=814)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=814)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 692, in __init__
(EngineCore_DP0 pid=814)     super().__init__(
(EngineCore_DP0 pid=814)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 106, in __init__
(EngineCore_DP0 pid=814)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=814)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=814)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=814)     self._init_executor()
(EngineCore_DP0 pid=814)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=814)     self.driver_worker.load_model()
(EngineCore_DP0 pid=814)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 274, in load_model
(EngineCore_DP0 pid=814)     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=814)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3827, in load_model
(EngineCore_DP0 pid=814)     self.model = model_loader.load_model(
(EngineCore_DP0 pid=814)                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=814)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 58, in load_model
(EngineCore_DP0 pid=814)     self.load_weights(model, model_config)
(EngineCore_DP0 pid=814)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 288, in load_weights
(EngineCore_DP0 pid=814)     loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(EngineCore_DP0 pid=814)                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=814)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/transformers/base.py", line 492, in load_weights
(EngineCore_DP0 pid=814)     return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
(EngineCore_DP0 pid=814)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=814)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_model_load_weights
(EngineCore_DP0 pid=814)     return original_load_weights(auto_weight_loader, weights, mapper=mapper)
(EngineCore_DP0 pid=814)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=814)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 335, in load_weights
(EngineCore_DP0 pid=814)     autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=814)                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=814)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=814)     yield from self._load_module(
(EngineCore_DP0 pid=814)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=814)     yield from self._load_module(
(EngineCore_DP0 pid=814)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=814)     yield from self._load_module(
(EngineCore_DP0 pid=814)   [Previous line repeated 2 more times]
(EngineCore_DP0 pid=814)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 319, in _load_module
(EngineCore_DP0 pid=814)     raise ValueError(msg)
(EngineCore_DP0 pid=814) ValueError: There is no module or parameter named 'model.layers.1.mlp.gate.e_score_correction_bias' in TransformersMoEForCausalLM
(EngineCore_DP0 pid=814) 
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
(EngineCore_DP0 pid=814) 
[rank0[]:[W120 05:17:22.920484914 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=611) Traceback (most recent call last):
(APIServer pid=611)   File "<frozen runpy>", line 198, in _run_module_as_main
(APIServer pid=611)   File "<frozen runpy>", line 88, in _run_code
(APIServer pid=611)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1390, in <module>
(APIServer pid=611)     uvloop.run(run_server(args))
(APIServer pid=611)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=611)     return __asyncio.run(
(APIServer pid=611)            ^^^^^^^^^^^^^^
(APIServer pid=611)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=611)     return runner.run(main)
(APIServer pid=611)            ^^^^^^^^^^^^^^^^
(APIServer pid=611)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=611)     return self._loop.run_until_complete(task)
(APIServer pid=611)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=611)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=611)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=611)     return await main
(APIServer pid=611)            ^^^^^^^^^^
(APIServer pid=611)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1319, in run_server
(APIServer pid=611)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=611)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1338, in run_server_worker
(APIServer pid=611)     async with build_async_engine_client(
(APIServer pid=611)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=611)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=611)     return await anext(self.gen)
(APIServer pid=611)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=611)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 173, in build_async_engine_client
(APIServer pid=611)     async with build_async_engine_client_from_engine_args(
(APIServer pid=611)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=611)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=611)     return await anext(self.gen)
(APIServer pid=611)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=611)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 214, in build_async_engine_client_from_engine_args
(APIServer pid=611)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=611)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=611)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 205, in from_vllm_config
(APIServer pid=611)     return cls(
(APIServer pid=611)            ^^^^
(APIServer pid=611)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 132, in __init__
(APIServer pid=611)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=611)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=611)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 122, in make_async_mp_client
(APIServer pid=611)     return AsyncMPClient(*client_args)
(APIServer pid=611)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=611)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 824, in __init__
(APIServer pid=611)     super().__init__(
(APIServer pid=611)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 479, in __init__
(APIServer pid=611)     with launch_core_engines(vllm_config, executor_class, log_stats) as (
(APIServer pid=611)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=611)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=611)     next(self.gen)
(APIServer pid=611)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 921, in launch_core_engines
(APIServer pid=611)     wait_for_engine_startup(
(APIServer pid=611)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 980, in wait_for_engine_startup
(APIServer pid=611)     raise RuntimeError(
(APIServer pid=611) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
stream closed: EOF for ai-services/ai-platform-glm47-flash-74f4f98df6-5vzmx (vllm)

CHNtentes

Jan 20

(EngineCore_DP0 pid=814) WARNING 01-20 05:17:04 [utils.py:184] TransformersMoEForCausalLM has no vLLM implementation, falling back to Transformers implementation. Some features may not be supported and performance may not be optimal.
(EngineCore_DP0 pid=814) INFO 01-20 05:17:04 [gpu_model_runner.py:3808] Starting to load model marksverdhei/GLM-4.7-Flash-FP8...
(EngineCore_DP0 pid=814) INFO 01-20 05:17:04 [base.py:134] Using Transformers modeling backend.

You should not use vllm with falling back to transformers implementation. It's possible that your vllm / transformers is not the latest version and therefore does not recognise the model's architecture.

tinging

Jan 20

make a new container based in nightly version by Dockerfile where you update your transformer

seltzer-cat

Jan 21

I'm also encountering this issue. I switched to vllm/vllm-openai:nightly and then added a step to install transformers into the container but there are multiple other dependency mis-matches that result from this.

victors2709

Jan 21

It works with https://hub.docker.com/layers/lmsysorg/sglang/dev-pr-17247/images/sha256-e2bc891ffdaa1dce421667c0d531fba31365dcdf088fcb9839da8809b9fd45e7

Yu21342

Jan 21

It works with https://hub.docker.com/layers/lmsysorg/sglang/dev-pr-17247/images/sha256-e2bc891ffdaa1dce421667c0d531fba31365dcdf088fcb9839da8809b9fd45e7

yes It works thank you！

divinefeng

Jan 22

hello, how it works? thank you!

dwojcik

Jan 22

I was able to run the model. Take a look at the discussion on GitHub where I described necessary steps: https://github.com/vllm-project/vllm/issues/32637#issuecomment-3782923143.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment