GPTQ-Int4 quantized version of Qwen2.5-Math-72B-Instruct with weight padding for vLLM tensor parallelism.
How to Run Qwen2.5-Math-72B with vLLM Tensor Parallelism: A Weight Padding Solution
TL;DR: Self-quantized Qwen2.5-72B models produce gibberish with vLLM tensor parallelism due to a dimension alignment issue. This article explains the root cause and provides a working solution using weight padding.
The Problem
If you've tried to quantize Qwen2.5-Math-72B-Instruct (or any Qwen2.5-72B variant) using GPTQ and serve it with vLLM using tensor_parallel_size=2, you may have encountered this frustrating behavior:
# Your vLLM command
vllm serve ./my-quantized-model --tensor-parallel-size 2 --quantization gptq_marlin
# Expected output
"The answer is 4"
# Actual output
"!!!!!!!!!!!!!!!!"
The model loads successfully, shows no errors, but produces complete gibberish. Yet the same model works perfectly with tensor_parallel_size=1.
Root Cause
After deep investigation, we traced this to a mathematical constraint in vLLM's Marlin kernel:
intermediate_size must be divisible by (group_size × tensor_parallel_size)
For Qwen2.5-72B models:
intermediate_size= 29568group_size(GPTQ default) = 128tensor_parallel_size= 2
The math:
29568 ÷ 128 = 231 groups
231 ÷ 2 = 115.5 ← Not an integer!
When vLLM attempts to split these 231 groups across 2 GPUs, it can't divide evenly. The Marlin kernel doesn't raise an error—it simply executes with misaligned memory accesses, producing corrupted output.
Why Official Qwen GPTQ Models Work
If you've used Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4 from Hugging Face, you may have noticed it works fine with tensor parallelism. That's because the official models have intermediate_size=29696, not 29568.
The Qwen team pads the weights before quantization:
29696 ÷ 128 = 232 groups
232 ÷ 2 = 116 ← Integer, works!
The Solution: Weight Padding
We need to pad the MLP weights to change intermediate_size from 29568 to 29696. The Qwen documentation provides an interleaving pattern that preserves mathematical equivalence.
Step 1: Pad the Weights
import torch
from torch.nn import functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
import json
import os
MODEL_ID = "Qwen/Qwen2.5-Math-72B-Instruct"
OUTPUT_DIR = "./Qwen2.5-Math-72B-Instruct-Padded"
# Load model
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype="auto",
device_map="cpu",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# Pad weights
pad_size = 128
sd = model.state_dict()
for k in sd:
v = sd[k]
if ('mlp.up_proj.weight' in k) or ('mlp.gate_proj.weight' in k):
# Interleaving pattern for [29568, 8192] -> [29696, 8192]
prev_v = F.pad(v.unsqueeze(1), (0, 0, 0, 1, 0, 0)).reshape(29568*2, -1)[:pad_size*2]
sd[k] = torch.cat([prev_v, v[pad_size:]], dim=0)
elif 'mlp.down_proj.weight' in k:
# Interleaving pattern for [8192, 29568] -> [8192, 29696]
prev_v = F.pad(v.unsqueeze(2), (0, 1)).reshape(8192, 29568*2)[:, :pad_size*2]
sd[k] = torch.cat([prev_v, v[:, pad_size:]], dim=1)
# Save
os.makedirs(OUTPUT_DIR, exist_ok=True)
torch.save(sd, f"{OUTPUT_DIR}/pytorch_model.bin")
tokenizer.save_pretrained(OUTPUT_DIR)
# Update config
config = model.config.to_dict()
config["intermediate_size"] = 29696
with open(f"{OUTPUT_DIR}/config.json", "w") as f:
json.dump(config, f, indent=2)
print("Padding complete!")
Step 2: Quantize with GPTQModel
from gptqmodel import GPTQModel, QuantizeConfig
from transformers import AutoTokenizer
from datasets import load_dataset
PADDED_DIR = "./Qwen2.5-Math-72B-Instruct-Padded"
OUTPUT_DIR = "./Qwen2.5-Math-72B-Instruct-GPTQ-Int4-TP2"
tokenizer = AutoTokenizer.from_pretrained(PADDED_DIR)
# Prepare calibration data
calibration_data = load_dataset(
"allenai/c4",
data_files="en/c4-train.00001-of-01024.json.gz",
split="train"
).select(range(512))
calibration_dataset = [
{"input_ids": tokenizer(s["text"], truncation=True, max_length=2048, return_tensors="pt")["input_ids"],
"attention_mask": tokenizer(s["text"], truncation=True, max_length=2048, return_tensors="pt")["attention_mask"]}
for s in calibration_data
]
# Quantize (Marlin-compatible settings)
quant_config = QuantizeConfig(
bits=4,
group_size=128,
desc_act=False, # Required for Marlin
sym=True, # Required for Marlin
)
model = GPTQModel.load(PADDED_DIR, quant_config, trust_remote_code=True)
model.quantize(calibration_dataset)
model.save(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print("Quantization complete!")
Step 3: Serve with vLLM
CUDA_VISIBLE_DEVICES=0,1 vllm serve ./Qwen2.5-Math-72B-Instruct-GPTQ-Int4-TP2 \
--tensor-parallel-size 2 \
--quantization gptq_marlin \
--gpu-memory-utilization 0.90 \
--max-model-len 4096 \
--dtype float16 \
--trust-remote-code \
--port 8000
Step 4: Test
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2.5-Math-72B-Instruct-GPTQ-Int4-TP2",
"prompt": "What is the derivative of x^2?",
"max_tokens": 100
}'
You should now see coherent mathematical responses instead of gibberish!
Why the Interleaving Pattern?
You might wonder why we use this complex interleaving instead of simply appending zeros at the end. The pattern ensures:
Mathematical equivalence: Zero rows in
up_proj/gate_projproduce zero activations, which multiply with zero columns indown_proj. The net contribution is 0×0=0.Quantization compatibility: The interleaving distributes the padding in a way that's compatible with GPTQ's group-wise quantization.
We verified equivalence with random tensors:
Max absolute difference: 1.31e-06 (floating point precision)
Performance Benchmarks
Tested on 2× NVIDIA A40 (48GB each) with 4 concurrent requests:
| Metric | Value |
|---|---|
| Generation throughput | ~95-99 tokens/s |
| Prompt throughput | ~50-180 tokens/s |
| GPU KV cache usage | 2.7%-6.4% |
| Prefix cache hit rate | ~8% |
This is production-ready performance for a 72B parameter model!
Mathematical Accuracy Validation
We tested the padded + quantized model across 10 diverse mathematical domains:
| # | Topic | Problem | Result |
|---|---|---|---|
| 1 | Calculus - Integration | ∫(x³ + 2x² - 5x + 3)dx | ✓ Power rule correct |
| 2 | Calculus - Derivative | dy/dx of ln(sin(x²)) | ✓ Chain rule (nested) correct |
| 3 | Linear Algebra | Eigenvalues of [[4,2],[1,3]] | ✓ λ=2, λ=5 correct |
| 4 | Probability | P(2 red balls without replacement) | ✓ Combination formula correct |
| 5 | Number Theory | GCD(252, 105) via Euclidean | ✓ Algorithm correct |
| 6 | Optimization | Minimum of x² - 6x + 11 | ✓ Completing square correct |
| 7 | Series | Sum of 1 + 1/2 + 1/4 + ... | ✓ Geometric series correct |
| 8 | Differential Equations | dy/dx = 2xy, y(0)=1 | ✓ Separation of variables correct |
| 9 | Statistics | Std dev of [12,15,18,22,25] | ✓ Variance calculation correct |
| 10 | Complex Analysis | (3+4i)/(1-2i) in a+bi form | ✓ Conjugate method correct |
Result: 10/10 correct — The padding preserves full mathematical reasoning capability.
Hardware Requirements
- GPUs: 2× NVIDIA A40 (48GB) or similar
- RAM: ~160GB for loading the 72B model
- Disk: ~300GB (base model + padded + quantized)
Compatibility
This solution works for:
tensor_parallel_size=2(232 ÷ 2 = 116)tensor_parallel_size=4(232 ÷ 4 = 58)tensor_parallel_size=8(232 ÷ 8 = 29)
Key Takeaways
The "!!!!" output is not random—it's a symptom of dimension misalignment in GPU kernels.
Official Qwen GPTQ models already have this fix—they ship with
intermediate_size=29696.Self-quantized models need padding—if you're quantizing from the base model, apply padding first.
This affects all Qwen2.5-72B variants—Math, Coder, base, and Instruct versions all have
intermediate_size=29568.
License & Attribution
This work is built with Qwen. The base model Qwen2.5-Math-72B-Instruct is subject to the Qwen LICENSE AGREEMENT.
Qwen is licensed under the Qwen LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved.
Note: Commercial use with >100 million monthly active users requires explicit authorization from Alibaba Cloud.
References
Built with Qwen. Tested with: Python 3.12, PyTorch 2.9.1+cu128, vLLM 0.13.0, GPTQModel 5.6.12 on 2× NVIDIA A40 GPUs.
- Downloads last month
- -
Model tree for Ubuku/Qwen2.5-Math-72B-Instruct-GPTQ-Int4-TP2
Base model
Qwen/Qwen2.5-72B