Instructions to use bartsolutions/DeepSeek-OCR-2-GPTQ-INT8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use bartsolutions/DeepSeek-OCR-2-GPTQ-INT8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="bartsolutions/DeepSeek-OCR-2-GPTQ-INT8")# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("bartsolutions/DeepSeek-OCR-2-GPTQ-INT8", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use bartsolutions/DeepSeek-OCR-2-GPTQ-INT8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "bartsolutions/DeepSeek-OCR-2-GPTQ-INT8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bartsolutions/DeepSeek-OCR-2-GPTQ-INT8", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/bartsolutions/DeepSeek-OCR-2-GPTQ-INT8
- SGLang
How to use bartsolutions/DeepSeek-OCR-2-GPTQ-INT8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "bartsolutions/DeepSeek-OCR-2-GPTQ-INT8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bartsolutions/DeepSeek-OCR-2-GPTQ-INT8", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "bartsolutions/DeepSeek-OCR-2-GPTQ-INT8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bartsolutions/DeepSeek-OCR-2-GPTQ-INT8", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use bartsolutions/DeepSeek-OCR-2-GPTQ-INT8 with Docker Model Runner:
docker model run hf.co/bartsolutions/DeepSeek-OCR-2-GPTQ-INT8
DeepSeek-OCR-2 — GPTQ INT8 (g32)
INT8 GPTQ quantization of deepseek-ai/DeepSeek-OCR-2
for vLLM inference. Produced with Intel AutoRound
using data-free RTN.
What's quantized
- Quantized (INT8 GPTQ, group_size=32, sym, desc_act=False): every linear in the DeepSeek-V2 decoder MLP and MoE experts — 2148 layers total.
- Left in fp16: decoder self-attention (q/k/v/o_proj), vision encoders
(
sam_model,qwen2_model),embed_tokens,lm_head,projector, all layernorms.
Attention and vision stay fp16 because vLLM's deepseek_ocr2.py builds those modules
with plain torch.nn.Linear (not the quant-aware parallel linears), so they can't
carry GPTQ tensor suffixes. Only the deepseek_v2 decoder is wrapped in quant-aware
linears in vLLM's current implementation.
| Size on disk | 4.49 GB |
| Quantized linears | 2148 |
| group_size | 32 (uniform) |
| Tensor parallel | works at TP=1 and TP=2 |
| dtype | fp16 + int32 (no bf16) |
Running with vLLM
Tested with vLLM v0.16.0 (CUDA 13). No source patches required.
vllm serve <path-to-this-checkpoint> \
--served-model-name deepseek-ai/DeepSeek-OCR-2 \
--dtype half \
--quantization gptq \
--gpu-memory-utilization 0.7 \
--max-num-seqs 16 \
--logits_processors vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor \
--no-enable-prefix-caching \
--mm-processor-cache-gb 0
Add --tensor-parallel-size 2 if you have two GPUs; the g32 layout aligns cleanly
for TP=2 on every quantized layer.
Flag notes:
--logits_processors ...NGramPerReqLogitsProcessoris the OCR-specific decoding helper that ships in vLLM'sdeepseek_ocrmodule.--no-enable-prefix-cachingand--mm-processor-cache-gb 0are appropriate for OCR (every image is unique; prefix and mm caches just waste memory).- No
trust_remote_codeneeded (and shouldn't be passed). vLLM uses its in-treeDeepseekOCR2ForCausalLMclass. The original HFauto_mapwas removed fromconfig.jsonto prevent transformers from demanding it.
Other notes
model_typeisdeepseek_vl_v2(matching the original HF config); this is what vLLM's registry keys on.- No
g_idxin the checkpoint:desc_act=Falsemeans vLLM doesn't create theg_idxparameter, and shipping one breaks loading. - Group size 32 chosen over 64/128 for slightly better quantization fidelity (smaller groups → tighter per-group dynamic range). The size cost is negligible (~150 MB of extra scales on a 4.5 GB file).
Quantization recipe (summary)
from auto_round import AutoRound
# scheme="W8A16", group_size=32, sym=True, iters=0, disable_opt_rtn=True
# skip via layer_config: model.layers.{i}.self_attn.{q,k,v,o}_proj (bits=16)
ar.save_quantized(out_dir, format="auto_gptq")
Post-processing applied: auto_map stripped from config, model_type reset to
deepseek_vl_v2, modules_to_not_convert populated for attention, all bf16/f32
tensors cast to fp16, all g_idx tensors dropped, single safetensors file.
License
Inherits the original DeepSeek-OCR-2 license.
Acknowledgements
- Original model: DeepSeek-AI
- Quantization tool: Intel AutoRound
- Inference engine: vLLM
- Downloads last month
- 332
Model tree for bartsolutions/DeepSeek-OCR-2-GPTQ-INT8
Base model
deepseek-ai/DeepSeek-OCR-2