GPTQ-Int4 quantized version of Qwen2.5-Math-72B-Instruct with weight padding for vLLM tensor parallelism.

How to Run Qwen2.5-Math-72B with vLLM Tensor Parallelism: A Weight Padding Solution

TL;DR: Self-quantized Qwen2.5-72B models produce gibberish with vLLM tensor parallelism due to a dimension alignment issue. This article explains the root cause and provides a working solution using weight padding.


The Problem

If you've tried to quantize Qwen2.5-Math-72B-Instruct (or any Qwen2.5-72B variant) using GPTQ and serve it with vLLM using tensor_parallel_size=2, you may have encountered this frustrating behavior:

# Your vLLM command
vllm serve ./my-quantized-model --tensor-parallel-size 2 --quantization gptq_marlin

# Expected output
"The answer is 4"

# Actual output
"!!!!!!!!!!!!!!!!"

The model loads successfully, shows no errors, but produces complete gibberish. Yet the same model works perfectly with tensor_parallel_size=1.

Root Cause

After deep investigation, we traced this to a mathematical constraint in vLLM's Marlin kernel:

intermediate_size must be divisible by (group_size × tensor_parallel_size)

For Qwen2.5-72B models:

  • intermediate_size = 29568
  • group_size (GPTQ default) = 128
  • tensor_parallel_size = 2

The math:

29568 ÷ 128 = 231 groups
231 ÷ 2 = 115.5  ← Not an integer!

When vLLM attempts to split these 231 groups across 2 GPUs, it can't divide evenly. The Marlin kernel doesn't raise an error—it simply executes with misaligned memory accesses, producing corrupted output.

Why Official Qwen GPTQ Models Work

If you've used Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4 from Hugging Face, you may have noticed it works fine with tensor parallelism. That's because the official models have intermediate_size=29696, not 29568.

The Qwen team pads the weights before quantization:

29696 ÷ 128 = 232 groups
232 ÷ 2 = 116  ← Integer, works!

The Solution: Weight Padding

We need to pad the MLP weights to change intermediate_size from 29568 to 29696. The Qwen documentation provides an interleaving pattern that preserves mathematical equivalence.

Step 1: Pad the Weights

import torch
from torch.nn import functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
import json
import os

MODEL_ID = "Qwen/Qwen2.5-Math-72B-Instruct"
OUTPUT_DIR = "./Qwen2.5-Math-72B-Instruct-Padded"

# Load model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype="auto",
    device_map="cpu",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Pad weights
pad_size = 128
sd = model.state_dict()

for k in sd:
    v = sd[k]
    if ('mlp.up_proj.weight' in k) or ('mlp.gate_proj.weight' in k):
        # Interleaving pattern for [29568, 8192] -> [29696, 8192]
        prev_v = F.pad(v.unsqueeze(1), (0, 0, 0, 1, 0, 0)).reshape(29568*2, -1)[:pad_size*2]
        sd[k] = torch.cat([prev_v, v[pad_size:]], dim=0)
    elif 'mlp.down_proj.weight' in k:
        # Interleaving pattern for [8192, 29568] -> [8192, 29696]
        prev_v = F.pad(v.unsqueeze(2), (0, 1)).reshape(8192, 29568*2)[:, :pad_size*2]
        sd[k] = torch.cat([prev_v, v[:, pad_size:]], dim=1)

# Save
os.makedirs(OUTPUT_DIR, exist_ok=True)
torch.save(sd, f"{OUTPUT_DIR}/pytorch_model.bin")
tokenizer.save_pretrained(OUTPUT_DIR)

# Update config
config = model.config.to_dict()
config["intermediate_size"] = 29696
with open(f"{OUTPUT_DIR}/config.json", "w") as f:
    json.dump(config, f, indent=2)

print("Padding complete!")

Step 2: Quantize with GPTQModel

from gptqmodel import GPTQModel, QuantizeConfig
from transformers import AutoTokenizer
from datasets import load_dataset

PADDED_DIR = "./Qwen2.5-Math-72B-Instruct-Padded"
OUTPUT_DIR = "./Qwen2.5-Math-72B-Instruct-GPTQ-Int4-TP2"

tokenizer = AutoTokenizer.from_pretrained(PADDED_DIR)

# Prepare calibration data
calibration_data = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
).select(range(512))

calibration_dataset = [
    {"input_ids": tokenizer(s["text"], truncation=True, max_length=2048, return_tensors="pt")["input_ids"],
     "attention_mask": tokenizer(s["text"], truncation=True, max_length=2048, return_tensors="pt")["attention_mask"]}
    for s in calibration_data
]

# Quantize (Marlin-compatible settings)
quant_config = QuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=False,  # Required for Marlin
    sym=True,        # Required for Marlin
)

model = GPTQModel.load(PADDED_DIR, quant_config, trust_remote_code=True)
model.quantize(calibration_dataset)
model.save(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

print("Quantization complete!")

Step 3: Serve with vLLM

CUDA_VISIBLE_DEVICES=0,1 vllm serve ./Qwen2.5-Math-72B-Instruct-GPTQ-Int4-TP2 \
    --tensor-parallel-size 2 \
    --quantization gptq_marlin \
    --gpu-memory-utilization 0.90 \
    --max-model-len 4096 \
    --dtype float16 \
    --trust-remote-code \
    --port 8000

Step 4: Test

curl -X POST http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen2.5-Math-72B-Instruct-GPTQ-Int4-TP2",
        "prompt": "What is the derivative of x^2?",
        "max_tokens": 100
    }'

You should now see coherent mathematical responses instead of gibberish!

Why the Interleaving Pattern?

You might wonder why we use this complex interleaving instead of simply appending zeros at the end. The pattern ensures:

  1. Mathematical equivalence: Zero rows in up_proj/gate_proj produce zero activations, which multiply with zero columns in down_proj. The net contribution is 0×0=0.

  2. Quantization compatibility: The interleaving distributes the padding in a way that's compatible with GPTQ's group-wise quantization.

We verified equivalence with random tensors:

Max absolute difference: 1.31e-06 (floating point precision)

Performance Benchmarks

Tested on 2× NVIDIA A40 (48GB each) with 4 concurrent requests:

Metric Value
Generation throughput ~95-99 tokens/s
Prompt throughput ~50-180 tokens/s
GPU KV cache usage 2.7%-6.4%
Prefix cache hit rate ~8%

This is production-ready performance for a 72B parameter model!

Mathematical Accuracy Validation

We tested the padded + quantized model across 10 diverse mathematical domains:

# Topic Problem Result
1 Calculus - Integration ∫(x³ + 2x² - 5x + 3)dx ✓ Power rule correct
2 Calculus - Derivative dy/dx of ln(sin(x²)) ✓ Chain rule (nested) correct
3 Linear Algebra Eigenvalues of [[4,2],[1,3]] ✓ λ=2, λ=5 correct
4 Probability P(2 red balls without replacement) ✓ Combination formula correct
5 Number Theory GCD(252, 105) via Euclidean ✓ Algorithm correct
6 Optimization Minimum of x² - 6x + 11 ✓ Completing square correct
7 Series Sum of 1 + 1/2 + 1/4 + ... ✓ Geometric series correct
8 Differential Equations dy/dx = 2xy, y(0)=1 ✓ Separation of variables correct
9 Statistics Std dev of [12,15,18,22,25] ✓ Variance calculation correct
10 Complex Analysis (3+4i)/(1-2i) in a+bi form ✓ Conjugate method correct

Result: 10/10 correct — The padding preserves full mathematical reasoning capability.

Hardware Requirements

  • GPUs: 2× NVIDIA A40 (48GB) or similar
  • RAM: ~160GB for loading the 72B model
  • Disk: ~300GB (base model + padded + quantized)

Compatibility

This solution works for:

  • tensor_parallel_size=2 (232 ÷ 2 = 116)
  • tensor_parallel_size=4 (232 ÷ 4 = 58)
  • tensor_parallel_size=8 (232 ÷ 8 = 29)

Key Takeaways

  1. The "!!!!" output is not random—it's a symptom of dimension misalignment in GPU kernels.

  2. Official Qwen GPTQ models already have this fix—they ship with intermediate_size=29696.

  3. Self-quantized models need padding—if you're quantizing from the base model, apply padding first.

  4. This affects all Qwen2.5-72B variants—Math, Coder, base, and Instruct versions all have intermediate_size=29568.

License & Attribution

This work is built with Qwen. The base model Qwen2.5-Math-72B-Instruct is subject to the Qwen LICENSE AGREEMENT.

Qwen is licensed under the Qwen LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved.

Note: Commercial use with >100 million monthly active users requires explicit authorization from Alibaba Cloud.

References


Built with Qwen. Tested with: Python 3.12, PyTorch 2.9.1+cu128, vLLM 0.13.0, GPTQModel 5.6.12 on 2× NVIDIA A40 GPUs.

Downloads last month
-
Safetensors
Model size
73B params
Tensor type
BF16
·
F16
·
I32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Ubuku/Qwen2.5-Math-72B-Instruct-GPTQ-Int4-TP2

Base model

Qwen/Qwen2.5-72B
Quantized
(14)
this model