Qwen3.5-9B-GPTQ-INT8

This model is a GPTQ-quantized version of Qwen/Qwen3.5-9B with a normalized text-only config.json.

Quantization

  • Method: GPTQ
  • Bits: 8
  • Group size: 128
  • desc_act: False
  • damp_percent: 0.1
  • Calibration preset: math_qa_cot
  • Calibration dataset: zwhe99/DeepMath-103K split train
  • Max calibration samples: 128
  • Max sequence length: 16384

Reproduction

uv run python quantization/quantize_qwen35_9b_gptq.py \
  --model-name Qwen/Qwen3.5-9B \
  --output-dir /workspace/lowbit-math-reasoning/experiments/models/Qwen3.5-9B-GPTQ-INT8 \
  --dataset-name zwhe99/DeepMath-103K \
  --dataset-config '' \
  --dataset-split train \
  --calibration-preset math_qa_cot \
  --question-column question \
  --answer-column r1_solution_1 \
  --text-column r1_solution_1 \
  --max-calibration-samples 128 \
  --max-seq-len 16384 \
  --bits 8 \
  --group-size 128 \
  --damp-percent 0.1

The current quantization script rewrites config.json after save_pretrained() so the exported checkpoint uses the same text-only qwen3_5_text layout as the working INT4 checkpoint.

Validation

This normalized-config checkpoint was re-evaluated on GSM8K and matched the original INT8 accuracy while improving throughput substantially.

  • Original INT8: EM 0.96, 105.98 tok/s
  • Fixed-config INT8: EM 0.96, 150.84 tok/s

Notes

  • This repository contains quantized weights only.
  • The checkpoint is intended for text-only evaluation.
  • vLLM loads this checkpoint as gptq_marlin.
Downloads last month
1,195
Safetensors
Model size
9B params
Tensor type
I32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mssfj/Qwen3.5-9B-GPTQ-INT8

Finetuned
Qwen/Qwen3.5-9B
Quantized
(171)
this model