vLLM fails to load AntAngelMed GGUF: "GGUF model with architecture bailingmoe2 is not supported yet."

#8
by AlanWus - opened

Hi maintainers,

I’m trying to serve AntAngelMed from a GGUF checkpoint using vLLM, but the server fails during initialization with:

ValueError: GGUF model with architecture bailingmoe2 is not supported yet.

Repo

Model file

  • File: /content/drive/MyDrive/AntAngelMed.i1-IQ3_XXS.gguf
  • Quant: IQ3_XXS (GGUF)

What I’m trying to do

Serve the GGUF model with vLLM OpenAI-compatible server.

Command:

nohup vllm serve /content/drive/MyDrive/AntAngelMed.i1-IQ3_XXS.gguf \
  --dtype auto \
  --trust-remote-code \
  --tensor-parallel-size 1 \
  --served-model-name AntAngelMed \
  --gpu-memory-utilization 0.95 \
  --port 8000 \
  > serve.log 2>&1 &

Expected behavior

vLLM server starts successfully and exposes the OpenAI-compatible API endpoint.

Actual behavior

vLLM crashes at startup. Full traceback ends with:

ValueError: GGUF model with architecture bailingmoe2 is not supported yet.

Relevant log snippet:

(APIServer pid=4264) INFO 01-09 08:00:16 [api_server.py:1351] vLLM API server version 0.13.0
(APIServer pid=4264) INFO 01-09 08:00:16 [utils.py:253] non-default args: {'model_tag': '/content/drive/MyDrive/AntAngelMed.i1-IQ3_XXS.gguf', 'model': '/content/drive/MyDrive/AntAngelMed.i1-IQ3_XXS.gguf', 'trust_remote_code': True, 'served_model_name': ['AntAngelMed'], 'gpu_memory_utilization': 0.95}
(APIServer pid=4264) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
...
(APIServer pid=4264)   File "/usr/local/lib/python3.12/dist-packages/transformers/modeling_gguf_pytorch_utils.py", line 431, in load_gguf_checkpoint
(APIServer pid=4264)     raise ValueError(f"GGUF model with architecture {architecture} is not supported yet.")
(APIServer pid=4264) ValueError: GGUF model with architecture bailingmoe2 is not supported yet.

Environment

  • OS/Runtime: (Colab) Linux

  • Python: 3.12.12

  • vLLM: 0.13.0

  • transformers: 4.57.3

  • torch: 2.9.0+cu126

  • GPU: **A100 80GB *1 **

  • Additional packages seen in env:

    • gguf: 0.17.1

What I’ve tried

  • Changing vLLM args (--dtype auto, --trust-remote-code, etc.) does not help.
  • The crash happens before model serving (during config loading / GGUF parsing).

Questions / Request for guidance

  1. Is the GGUF build of AntAngelMed expected to be compatible with vLLM today?
  2. If not, what is the recommended inference backend for this GGUF (., llama.cpp/llama-server, or a specific fork / required version)?
  3. Is there an alternative release format (HF weights / safetensors) that works with vLLM?
  4. If vLLM support is planned, is there a known patch / PR / compatibility roadmap for bailingmoe2 GGUF architecture?

Thanks! I can provide more logs or run additional debug commands if needed.

Sign up or log in to comment