Qwen3-VL-30B-A3B-Instruct.w4a16

Model Overview

  • Model Optimizations:
    • Weight quantization: INT4

Model Optimizations

This model was obtained by quantizing the weights of Qwen/Qwen3-VL-30B-A3B-Instruct to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.

Only the weights of the linear operators within transformers blocks are quantized. Weights are quantized using a symmetric per-group scheme, with group size 128. The AutoRound algorithm is applied for quantization, as implemented as AutoRoundModifier in the llm-compressor library.

Deployment

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

vllm serve EmbeddedLLM/Qwen3-VL-30B-A3B-Instruct.w4a16

Creation

Creation details This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
from auto_round.calib_dataset import get_dataset
from transformers import AutoTokenizer, Qwen3VLMoeForConditionalGeneration, AutoProcessor

from llmcompressor import oneshot
from llmcompressor.modifiers.autoround import AutoRoundModifier
from llmcompressor.utils import dispatch_for_generation

# Select model and load it.
model_id = "Qwen/Qwen3-VL-30B-A3B-Instruct"
model = Qwen3VLMoeForConditionalGeneration.from_pretrained(model_id, dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)


# Select calibration dataset.
NUM_CALIBRATION_SAMPLES = 128
MAX_SEQUENCE_LENGTH = 2048
# Get aligned calibration dataset.
ds = get_dataset(
    tokenizer=tokenizer,
    seqlen=MAX_SEQUENCE_LENGTH,
    nsamples=NUM_CALIBRATION_SAMPLES,
)


# Configure the quantization algorithm to run.
#   * quantize the weights to 4 bit with AutoRound with a group size 128
recipe = AutoRoundModifier(
    targets="Linear",
    scheme="W4A16",
    ignore=[
        "re:.*lm_head",
        "re:visual.*",
        "re:model.visual.*",
        "re:.*mlp.gate$",
    ],
    iters=200
)

oneshot(
  model=model,
  dataset=ds,
  recipe=recipe,
  max_seq_length=MAX_SEQUENCE_LENGTH,
  num_calibration_samples=NUM_CALIBRATION_SAMPLES,
  # disable shuffling to get slightly better mmlu score
  shuffle_calibration_samples=False,
)

# Save to disk compressed.
SAVE_DIR = model_id.rstrip("/").split("/")[-1] + ".w4a16"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
processor.save_pretrained(SAVE_DIR)

Evaluation

After started the vllm server as shown above, the model was evaluated on with the following commands.

Disclaimer: Results may differ from official benchmarks due to evaluation setup variations.

Evaluation details

lm-evaluation-harness

lm_eval
  --model local-chat-completions \
  --model_args model="EmbeddedLLM/Qwen3-VL-30B-A3B-Instruct.w4a16",base_url=http://127.0.0.1:8000/v1/chat/completions,num_concurrent=64 \
  --tasks mmlu_pro \
  --apply_chat_template

lmms-eval

lmms-eval \
  --model async_openai \
  --model_args model_version=EmbeddedLLM/Qwen3-VL-30B-A3B-Instruct.w4a16,base_url=http://127.0.0.1:8000/v1,is_qwen3_vl=True,api_key=DUMMY,num_cpus=8 \
  --tasks mmmu_val

Accuracy

Category Benchmark Qwen3-VL-30B-A3B-Instruct Qwen3-VL-30B-A3B-Instruct.w4a16
(this model)
Recovery
Text MMLU_pro 71.28 69.55 97.6%
Vision MMMU_val 51.56 53.22 103.2%
Downloads last month
59
Safetensors
Model size
5B params
Tensor type
I64
I32
BF16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for EmbeddedLLM/Qwen3-VL-30B-A3B-Instruct.w4a16

Quantized
(47)
this model