AWQ 4-bit produces repetitive gibberish on long outputs with vLLM v0.15.1 — same bug as cyankiwi variant

by BigBlueWhale - opened Feb 6

Feb 6

Summary

This model is expected to have the same gibberish-on-long-outputs bug as cyankiwi/Devstral-Small-2-24B-Instruct-2512-AWQ-4bit when served via vllm/vllm-openai:v0.15.1. The root cause is not the quantization or the group size difference (128 vs 32) — it is a chain of three bugs in vLLM v0.15.1 that result in the wrong text backbone class (MistralForCausalLM) being loaded instead of the correct Ministral3ForCausalLM. Both AWQ variants share the same config.json structure that triggers this bug.

I confirmed the bug on the cyankiwi variant (verbatim gibberish output below). I have not tested this specific model, but its config.json has the same two fields that trigger the bug:

text_config.model_type: "ministral3" (unrecognized by transformers v4.57.6, must be patched to "mistral")
text_config.architectures: not set (null)

These two conditions cause a Pixtral-12B special case in Mistral3ForConditionalGeneration.__init__ to fire, forcing the wrong text backbone.

Full investigation with exact code paths, config comparisons, and a survey of all alternative quantizations: GIBBERISH_BUG_REPORT.md

Verbatim gibberish output (from cyankiwi variant, same architecture)

Prompt: "Write a Python function that implements binary search on a sorted list. Include type hints and a docstring."

Here's a Python function that implements binary search on a sorted list. The function includes type hints and a docstring.

```python
from typing import TypeVar, List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union
from typing import List, Tuple, Generic, Union

Short responses and tool calls work fine. The degeneration is 100% reproducible on any prompt requiring more than ~50-100 tokens of output.

Root cause: wrong text backbone class

In vllm/model_executor/models/mistral3.py, Mistral3ForConditionalGeneration.__init__ has this special case:

# NOTE: These are special cases for Pixtral-12B in the HF-format
if (
    config.text_config.architectures is None
    and config.text_config.model_type == "mistral"
):
    config.text_config.architectures = ["MistralForCausalLM"]

This fires because text_config.model_type must be patched from "ministral3" to "mistral" (transformers v4.57.6 does not recognize ministral3), and text_config.architectures is null in the model's config.json. vLLM then loads MistralForCausalLM (the old Mistral 7B architecture) instead of the correct Ministral3ForCausalLM, which does not exist in vLLM v0.15.1's model registry.

Why the Mistral-native loading path is also blocked

load-format: "mistral" requires consolidated-*.safetensors files (Mistral-native format). This model only ships HuggingFace-format sharded safetensors (model-*.safetensors), which forces the HuggingFace config path, which triggers the bug chain above. The official FP8 model ships both formats and works via the Mistral-native path.

No fix available at the config level

The real fix requires vLLM to either merge the transformers v5 bump (vllm-project/vllm#30566, open since December 12, 2025) or add Ministral3ForCausalLM to its model registry.

Environment

Docker image: vllm/vllm-openai:v0.15.1 (February 4, 2026)
transformers inside container: 4.57.6 (pinned to >= 4.56.0, < 5)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment