vLLM fails to load AntAngelMed GGUF: "GGUF model with architecture bailingmoe2 is not supported yet."
#8
by
AlanWus
- opened
Hi maintainers,
I’m trying to serve AntAngelMed from a GGUF checkpoint using vLLM, but the server fails during initialization with:
ValueError: GGUF model with architecture bailingmoe2 is not supported yet.
Repo
- Repo: inclusionAI/AntAngelMed-i1-GGUF
/...AntAngelMed-GGUF
https://huggingface.co/mradermacher/AntAngelMed-i1-GGUF
Model file
- File:
/content/drive/MyDrive/AntAngelMed.i1-IQ3_XXS.gguf - Quant:
IQ3_XXS(GGUF)
What I’m trying to do
Serve the GGUF model with vLLM OpenAI-compatible server.
Command:
nohup vllm serve /content/drive/MyDrive/AntAngelMed.i1-IQ3_XXS.gguf \
--dtype auto \
--trust-remote-code \
--tensor-parallel-size 1 \
--served-model-name AntAngelMed \
--gpu-memory-utilization 0.95 \
--port 8000 \
> serve.log 2>&1 &
Expected behavior
vLLM server starts successfully and exposes the OpenAI-compatible API endpoint.
Actual behavior
vLLM crashes at startup. Full traceback ends with:
ValueError: GGUF model with architecture bailingmoe2 is not supported yet.
Relevant log snippet:
(APIServer pid=4264) INFO 01-09 08:00:16 [api_server.py:1351] vLLM API server version 0.13.0
(APIServer pid=4264) INFO 01-09 08:00:16 [utils.py:253] non-default args: {'model_tag': '/content/drive/MyDrive/AntAngelMed.i1-IQ3_XXS.gguf', 'model': '/content/drive/MyDrive/AntAngelMed.i1-IQ3_XXS.gguf', 'trust_remote_code': True, 'served_model_name': ['AntAngelMed'], 'gpu_memory_utilization': 0.95}
(APIServer pid=4264) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
...
(APIServer pid=4264) File "/usr/local/lib/python3.12/dist-packages/transformers/modeling_gguf_pytorch_utils.py", line 431, in load_gguf_checkpoint
(APIServer pid=4264) raise ValueError(f"GGUF model with architecture {architecture} is not supported yet.")
(APIServer pid=4264) ValueError: GGUF model with architecture bailingmoe2 is not supported yet.
Environment
OS/Runtime: (Colab) Linux
Python: 3.12.12
vLLM: 0.13.0
transformers: 4.57.3
torch: 2.9.0+cu126
GPU: **A100 80GB *1 **
Additional packages seen in env:
- gguf: 0.17.1
What I’ve tried
- Changing vLLM args (
--dtype auto,--trust-remote-code, etc.) does not help. - The crash happens before model serving (during config loading / GGUF parsing).
Questions / Request for guidance
- Is the GGUF build of AntAngelMed expected to be compatible with vLLM today?
- If not, what is the recommended inference backend for this GGUF (.,
llama.cpp/llama-server, or a specific fork / required version)? - Is there an alternative release format (HF weights / safetensors) that works with vLLM?
- If vLLM support is planned, is there a known patch / PR / compatibility roadmap for
bailingmoe2GGUF architecture?
Thanks! I can provide more logs or run additional debug commands if needed.