DeepSeek-OCR-2 — GPTQ INT8 (g32)

INT8 GPTQ quantization of deepseek-ai/DeepSeek-OCR-2 for vLLM inference. Produced with Intel AutoRound using data-free RTN.

What's quantized

  • Quantized (INT8 GPTQ, group_size=32, sym, desc_act=False): every linear in the DeepSeek-V2 decoder MLP and MoE experts — 2148 layers total.
  • Left in fp16: decoder self-attention (q/k/v/o_proj), vision encoders (sam_model, qwen2_model), embed_tokens, lm_head, projector, all layernorms.

Attention and vision stay fp16 because vLLM's deepseek_ocr2.py builds those modules with plain torch.nn.Linear (not the quant-aware parallel linears), so they can't carry GPTQ tensor suffixes. Only the deepseek_v2 decoder is wrapped in quant-aware linears in vLLM's current implementation.

Size on disk 4.49 GB
Quantized linears 2148
group_size 32 (uniform)
Tensor parallel works at TP=1 and TP=2
dtype fp16 + int32 (no bf16)

Running with vLLM

Tested with vLLM v0.16.0 (CUDA 13). No source patches required.

vllm serve <path-to-this-checkpoint> \
  --served-model-name deepseek-ai/DeepSeek-OCR-2 \
  --dtype half \
  --quantization gptq \
  --gpu-memory-utilization 0.7 \
  --max-num-seqs 16 \
  --logits_processors vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor \
  --no-enable-prefix-caching \
  --mm-processor-cache-gb 0

Add --tensor-parallel-size 2 if you have two GPUs; the g32 layout aligns cleanly for TP=2 on every quantized layer.

Flag notes:

  • --logits_processors ...NGramPerReqLogitsProcessor is the OCR-specific decoding helper that ships in vLLM's deepseek_ocr module.
  • --no-enable-prefix-caching and --mm-processor-cache-gb 0 are appropriate for OCR (every image is unique; prefix and mm caches just waste memory).
  • No trust_remote_code needed (and shouldn't be passed). vLLM uses its in-tree DeepseekOCR2ForCausalLM class. The original HF auto_map was removed from config.json to prevent transformers from demanding it.

Other notes

  • model_type is deepseek_vl_v2 (matching the original HF config); this is what vLLM's registry keys on.
  • No g_idx in the checkpoint: desc_act=False means vLLM doesn't create the g_idx parameter, and shipping one breaks loading.
  • Group size 32 chosen over 64/128 for slightly better quantization fidelity (smaller groups → tighter per-group dynamic range). The size cost is negligible (~150 MB of extra scales on a 4.5 GB file).

Quantization recipe (summary)

from auto_round import AutoRound
# scheme="W8A16", group_size=32, sym=True, iters=0, disable_opt_rtn=True
# skip via layer_config: model.layers.{i}.self_attn.{q,k,v,o}_proj  (bits=16)
ar.save_quantized(out_dir, format="auto_gptq")

Post-processing applied: auto_map stripped from config, model_type reset to deepseek_vl_v2, modules_to_not_convert populated for attention, all bf16/f32 tensors cast to fp16, all g_idx tensors dropped, single safetensors file.

License

Inherits the original DeepSeek-OCR-2 license.

Acknowledgements

Downloads last month
332
Safetensors
Model size
3B params
Tensor type
I32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bartsolutions/DeepSeek-OCR-2-GPTQ-INT8

Quantized
(4)
this model