vLLM fails to load AntAngelMed GGUF: "GGUF model with architecture bailingmoe2 is not supported yet."

by AlanWus - opened Jan 9

Jan 9

Hi maintainers,

I’m trying to serve AntAngelMed from a GGUF checkpoint using vLLM, but the server fails during initialization with:

ValueError: GGUF model with architecture bailingmoe2 is not supported yet.

Repo

Repo: inclusionAI/AntAngelMed-i1-GGUF/...AntAngelMed-GGUF
https://huggingface.co/mradermacher/AntAngelMed-i1-GGUF

Model file

File: /content/drive/MyDrive/AntAngelMed.i1-IQ3_XXS.gguf
Quant: IQ3_XXS (GGUF)

What I’m trying to do

Serve the GGUF model with vLLM OpenAI-compatible server.

Command:

nohup vllm serve /content/drive/MyDrive/AntAngelMed.i1-IQ3_XXS.gguf \
  --dtype auto \
  --trust-remote-code \
  --tensor-parallel-size 1 \
  --served-model-name AntAngelMed \
  --gpu-memory-utilization 0.95 \
  --port 8000 \
  > serve.log 2>&1 &

Expected behavior

vLLM server starts successfully and exposes the OpenAI-compatible API endpoint.

Actual behavior

vLLM crashes at startup. Full traceback ends with:

ValueError: GGUF model with architecture bailingmoe2 is not supported yet.

Relevant log snippet:

(APIServer pid=4264) INFO 01-09 08:00:16 [api_server.py:1351] vLLM API server version 0.13.0
(APIServer pid=4264) INFO 01-09 08:00:16 [utils.py:253] non-default args: {'model_tag': '/content/drive/MyDrive/AntAngelMed.i1-IQ3_XXS.gguf', 'model': '/content/drive/MyDrive/AntAngelMed.i1-IQ3_XXS.gguf', 'trust_remote_code': True, 'served_model_name': ['AntAngelMed'], 'gpu_memory_utilization': 0.95}
(APIServer pid=4264) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
...
(APIServer pid=4264)   File "/usr/local/lib/python3.12/dist-packages/transformers/modeling_gguf_pytorch_utils.py", line 431, in load_gguf_checkpoint
(APIServer pid=4264)     raise ValueError(f"GGUF model with architecture {architecture} is not supported yet.")
(APIServer pid=4264) ValueError: GGUF model with architecture bailingmoe2 is not supported yet.

Environment

OS/Runtime: (Colab) Linux
Python: 3.12.12
vLLM: 0.13.0
transformers: 4.57.3
torch: 2.9.0+cu126
GPU: **A100 80GB *1 **
Additional packages seen in env:
- gguf: 0.17.1

What I’ve tried

Changing vLLM args (--dtype auto, --trust-remote-code, etc.) does not help.
The crash happens before model serving (during config loading / GGUF parsing).

Questions / Request for guidance

Is the GGUF build of AntAngelMed expected to be compatible with vLLM today?
If not, what is the recommended inference backend for this GGUF (., llama.cpp/llama-server, or a specific fork / required version)?
Is there an alternative release format (HF weights / safetensors) that works with vLLM?
If vLLM support is planned, is there a known patch / PR / compatibility roadmap for bailingmoe2 GGUF architecture?

Thanks! I can provide more logs or run additional debug commands if needed.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment