Instructions to use ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g

SGLang

How to use ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g with Docker Model Runner:
```
docker model run hf.co/ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g
```

DeepSeek-R1-GPTQ-4b-128g

File size: 6,901 Bytes

ac66d87
 
 
 
 
 
 
 
 
 
 
76f1e4e
ac66d87
 
 
 
1257709
aa91cc7
f16877c
 
 
72c64b3
f16877c
1257709
 
ac66d87
 
 
 
3c4dc4f
1257709
da7c4a6
 
 
dedec7d
da7c4a6
 
 
 
 
 
 
 
 
 
 
 
 
ac66d87
 
 
 
 
 
 
dcd41a6
84870dc
ac66d87
 
 
 
 
 
 
 
 
 
 
 
dcd41a6
84870dc
ac66d87
84870dc
 
ac66d87
 
 
 
728319b
a5d9330
 
c707060
 
 
 
 
 
 
f043c47
c707060
 
 
a5d9330
7a94020

---
license: mit
library_name: transformers
---
# DeepSeek-R1-GPTQ-4b-128g
<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable html -->
<!-- markdownlint-disable no-duplicate-header -->

## Model Overview

This model was obtained by quantizing the weights of [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) to INT4 data type. This optimization reduces the number of bits per parameter from 8 to 4, reducing the disk size and GPU memory requirements by approximately 50%.

All layers within transformer blocks are compressed. Weights are quantized using a symmetric per-group scheme, with group size 128. The GPTQ algorithm is applied for quantization.

Model checkpoint is saved in [compressed_tensors](https://github.com/neuralmagic/compressed-tensors) format.

| Models | Experts Quantized | Attention blocks quantized | Size (GB) |
| ------ |  --------- | --------- | --------- |
| [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) | ❌ | ❌  | 671 GB |
| [ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g](https://huggingface.co/ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g) | ✅  | ✅  | 325 GB |
| [cognitivecomputations/DeepSeek-R1-AWQ](https://huggingface.co/cognitivecomputations/DeepSeek-R1-AWQ) | ✅  | ✅  | 340 GB |

### Evaluation

This model was evaluated on the OpenLLM v1 benchmarks and reasoning tasks (AIME-24, GPQA-Diamond, MATH-500). 

Model outputs were generated with the vLLM engine.

For reasoning tasks we estimate pass@1 based on 10 runs with different seeds and `temperature=0.6`, `top_p=0.95` and `max_new_tokens=32768`.

#### OpenLLM Leaderboard V1 tasks 

|                                              | Recovery (%) | Average Score | ARC-Challenge<br>acc_norm, 25-shot | GSM8k<br>exact_match, 5-shot | HellaSwag<br>acc_norm, 10-shot | MMLU<br>acc, 5-shot | TruthfulQA<br>mc2, 0-shot | WinoGrande<br>acc, 5-shot |
| ------------------------------------------ | :----------: | :-----------: | :--------------------------------: | :--------------------------: | :----------------------------: | :-----------------: | :-----------------------: | :-----------------------: |
| deepseek/DeepSeek-R1                         | 100.00       | 81.04         | 72.53                              | 95.91                        | 89.30                          | 87.22               | 59.28                     | 82.00                     |
| cognitivecomputations/DeepSeek-R1-AWQ        | 100.07       | 81.10         | 73.12                              | 95.15                        | 89.07                          | 86.86               | 60.09                     | 82.32                     |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g <br> **(this model)**         | 99.86        | 80.93         | 72.70                              | 95.68                        | 89.25                          | 86.83               | 58.77                     | 82.32                     |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts | 100.30       | 81.28         | 72.53                              | 95.68                        | 89.36                          | 86.99               | 59.77                     | 83.35                     |

#### Reasoning tasks (AIME-24, GPQA-Diamond, MATH-500)

|                                              | Recovery (%) | Average Score | AIME 2024<br>pass@1 | MATH-500<br>pass@1 | GPQA Diamond<br>pass@1 |
| -------------------------------------------- | :----------: | :-----------: | :-----------------: | :----------------: | :--------------------: |
| deepseek/DeepSeek-R1                         | 100.00       | 82.99         | 78.33               | 97.24              | 73.38                  |
| cognitivecomputations/DeepSeek-R1-AWQ        | 94.29        | 78.25         | 70.67               | 93.64              | 70.46                  |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g <br> **(this model)**        | 96.52        | 80.10         | 72.96               | 97.09              | 70.26                  |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts | 98.81        | 82.00         | 77.00               | 97.08              | 71.92                  |

## Reproduction

The results were obtained using the following commands:

`OpenLLM v1`
```bash
MODEL=ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-act_order-mse_scale
MODEL_ARGS="pretrained=$MODEL,dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True"

lm_eval \
  --model vllm \
  --model_args $MODEL_ARGS \
  --tasks openllm \
  --batch_size auto
```

For reasoning evals we adopted the protocol from the [open-r1 repository](https://github.com/huggingface/open-r1).

`Reasoning tasks`
```bash
MODEL=ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-act_order-mse_scale
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=38768,gpu_memory_utilization=0.8,tensor_parallel_size=8,add_special_tokens=false,generation_parameters={\"max_new_tokens\":32768,\"temperature\":0.6,\"top_p\":0.95,\"seed\":7686}"

export VLLM_WORKER_MULTIPROC_METHOD=spawn
lighteval vllm $MODEL_ARGS "custom|aime24|0|0,custom|math_500|0|0,custom|gpqa:diamond|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --output-dir $OUTPUT_DIR
```

Please use this version of vLLM: https://github.com/vllm-project/vllm/pull/16038

## Performance benchmarking
We follow the standard vLLM performance benchmarking with ShareGPT dataset and observe the following metrics (lower is better):

|                                              | Time to First Token<br>Median TTFT (ms) ↓ | Time per Output Token<br>Median TPOT (ms) ↓ | Inter-token Latency<br>Median ITL (ms) ↓ |
| -------------------------------------------- | :-------------------------------------: | :---------------------------------------: | :------------------------------------: |
| cognitivecomputations/DeepSeek-R1-AWQ        | 1585.45                                 | 55.41                                     | 43.06                                  |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts | 1344.68                                 | 41.49                                     | 36.33                                  |
| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g <br> **(this model)**        | 815.19                                  | 44.65                                     | 37.88                                  |

GPTQ models are faster across all metrics than AWQ models because GPTQ uses less bits-per-parameter than AWQ. More specifically, AWQ has to use smaller group-size of 64 (vs 128 in GPTQ) to preserve accuracy, and zero-points due to asymmetric quantization. 

## Contributors
Denis Kuznedelev (Yandex), Eldar Kurtić (Red Hat AI & ISTA), Jiale Chen (ISTA), Michael Goin (Red Hat AI), Elias Frantar (ISTA), Dan Alistarh (Red Hat AI & ISTA).