Instructions to use bartsolutions/DeepSeek-OCR-2-GPTQ-INT8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use bartsolutions/DeepSeek-OCR-2-GPTQ-INT8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="bartsolutions/DeepSeek-OCR-2-GPTQ-INT8")

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("bartsolutions/DeepSeek-OCR-2-GPTQ-INT8", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use bartsolutions/DeepSeek-OCR-2-GPTQ-INT8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "bartsolutions/DeepSeek-OCR-2-GPTQ-INT8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "bartsolutions/DeepSeek-OCR-2-GPTQ-INT8",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/bartsolutions/DeepSeek-OCR-2-GPTQ-INT8

SGLang

How to use bartsolutions/DeepSeek-OCR-2-GPTQ-INT8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "bartsolutions/DeepSeek-OCR-2-GPTQ-INT8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "bartsolutions/DeepSeek-OCR-2-GPTQ-INT8",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "bartsolutions/DeepSeek-OCR-2-GPTQ-INT8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "bartsolutions/DeepSeek-OCR-2-GPTQ-INT8",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use bartsolutions/DeepSeek-OCR-2-GPTQ-INT8 with Docker Model Runner:
```
docker model run hf.co/bartsolutions/DeepSeek-OCR-2-GPTQ-INT8
```

DeepSeek-OCR-2 — GPTQ INT8 (g32)

INT8 GPTQ quantization of deepseek-ai/DeepSeek-OCR-2 for vLLM inference. Produced with Intel AutoRound using data-free RTN.

What's quantized

Quantized (INT8 GPTQ, group_size=32, sym, desc_act=False): every linear in the DeepSeek-V2 decoder MLP and MoE experts — 2148 layers total.
Left in fp16: decoder self-attention (q/k/v/o_proj), vision encoders (sam_model, qwen2_model), embed_tokens, lm_head, projector, all layernorms.

Attention and vision stay fp16 because vLLM's deepseek_ocr2.py builds those modules with plain torch.nn.Linear (not the quant-aware parallel linears), so they can't carry GPTQ tensor suffixes. Only the deepseek_v2 decoder is wrapped in quant-aware linears in vLLM's current implementation.


Size on disk	4.49 GB
Quantized linears	2148
group_size	32 (uniform)
Tensor parallel	works at TP=1 and TP=2
dtype	fp16 + int32 (no bf16)

Running with vLLM

Tested with vLLM v0.16.0 (CUDA 13). No source patches required.

vllm serve <path-to-this-checkpoint> \
  --served-model-name deepseek-ai/DeepSeek-OCR-2 \
  --dtype half \
  --quantization gptq \
  --gpu-memory-utilization 0.7 \
  --max-num-seqs 16 \
  --logits_processors vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor \
  --no-enable-prefix-caching \
  --mm-processor-cache-gb 0

Add --tensor-parallel-size 2 if you have two GPUs; the g32 layout aligns cleanly for TP=2 on every quantized layer.

Flag notes:

--logits_processors ...NGramPerReqLogitsProcessor is the OCR-specific decoding helper that ships in vLLM's deepseek_ocr module.
--no-enable-prefix-caching and --mm-processor-cache-gb 0 are appropriate for OCR (every image is unique; prefix and mm caches just waste memory).
No trust_remote_code needed (and shouldn't be passed). vLLM uses its in-tree DeepseekOCR2ForCausalLM class. The original HF auto_map was removed from config.json to prevent transformers from demanding it.

Other notes

model_type is deepseek_vl_v2 (matching the original HF config); this is what vLLM's registry keys on.
No g_idx in the checkpoint: desc_act=False means vLLM doesn't create the g_idx parameter, and shipping one breaks loading.
Group size 32 chosen over 64/128 for slightly better quantization fidelity (smaller groups → tighter per-group dynamic range). The size cost is negligible (~150 MB of extra scales on a 4.5 GB file).

Quantization recipe (summary)

from auto_round import AutoRound
# scheme="W8A16", group_size=32, sym=True, iters=0, disable_opt_rtn=True
# skip via layer_config: model.layers.{i}.self_attn.{q,k,v,o}_proj  (bits=16)
ar.save_quantized(out_dir, format="auto_gptq")

Post-processing applied: auto_map stripped from config, model_type reset to deepseek_vl_v2, modules_to_not_convert populated for attention, all bf16/f32 tensors cast to fp16, all g_idx tensors dropped, single safetensors file.

License

Inherits the original DeepSeek-OCR-2 license.

Acknowledgements

Original model: DeepSeek-AI
Quantization tool: Intel AutoRound
Inference engine: vLLM

Downloads last month: 332

Safetensors

Model size

3B params

Tensor type

I32

F16

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bartsolutions/DeepSeek-OCR-2-GPTQ-INT8

Base model

deepseek-ai/DeepSeek-OCR-2

Quantized

(4)

this model