Instructions to use ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g

SGLang

How to use ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g with Docker Model Runner:
```
docker model run hf.co/ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g
```

DeepSeek-R1-GPTQ-4b-128g / README.md

ekurtic

Update README.md

aa91cc7 verified about 1 year ago

preview code

raw

history blame contribute delete

6.9 kB

	---
	license: mit
	library_name: transformers
	---
	# DeepSeek-R1-GPTQ-4b-128g
	<!-- markdownlint-disable first-line-h1 -->
	<!-- markdownlint-disable html -->
	<!-- markdownlint-disable no-duplicate-header -->

	## Model Overview

	This model was obtained by quantizing the weights of [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) to INT4 data type. This optimization reduces the number of bits per parameter from 8 to 4, reducing the disk size and GPU memory requirements by approximately 50%.

	All layers within transformer blocks are compressed. Weights are quantized using a symmetric per-group scheme, with group size 128. The GPTQ algorithm is applied for quantization.

	Model checkpoint is saved in [compressed_tensors](https://github.com/neuralmagic/compressed-tensors) format.

	\| Models \| Experts Quantized \| Attention blocks quantized \| Size (GB) \|
	\| ------ \| --------- \| --------- \| --------- \|
	\| [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) \| ❌ \| ❌ \| 671 GB \|
	\| [ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g](https://huggingface.co/ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g) \| ✅ \| ✅ \| 325 GB \|
	\| [cognitivecomputations/DeepSeek-R1-AWQ](https://huggingface.co/cognitivecomputations/DeepSeek-R1-AWQ) \| ✅ \| ✅ \| 340 GB \|

	### Evaluation

	This model was evaluated on the OpenLLM v1 benchmarks and reasoning tasks (AIME-24, GPQA-Diamond, MATH-500).

	Model outputs were generated with the vLLM engine.

	For reasoning tasks we estimate pass@1 based on 10 runs with different seeds and `temperature=0.6`, `top_p=0.95` and `max_new_tokens=32768`.

	#### OpenLLM Leaderboard V1 tasks

	\| \| Recovery (%) \| Average Score \| ARC-Challenge<br>acc_norm, 25-shot \| GSM8k<br>exact_match, 5-shot \| HellaSwag<br>acc_norm, 10-shot \| MMLU<br>acc, 5-shot \| TruthfulQA<br>mc2, 0-shot \| WinoGrande<br>acc, 5-shot \|
	\| ------------------------------------------ \| :----------: \| :-----------: \| :--------------------------------: \| :--------------------------: \| :----------------------------: \| :-----------------: \| :-----------------------: \| :-----------------------: \|
	\| deepseek/DeepSeek-R1 \| 100.00 \| 81.04 \| 72.53 \| 95.91 \| 89.30 \| 87.22 \| 59.28 \| 82.00 \|
	\| cognitivecomputations/DeepSeek-R1-AWQ \| 100.07 \| 81.10 \| 73.12 \| 95.15 \| 89.07 \| 86.86 \| 60.09 \| 82.32 \|
	\| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g <br> (this model) \| 99.86 \| 80.93 \| 72.70 \| 95.68 \| 89.25 \| 86.83 \| 58.77 \| 82.32 \|
	\| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts \| 100.30 \| 81.28 \| 72.53 \| 95.68 \| 89.36 \| 86.99 \| 59.77 \| 83.35 \|

	#### Reasoning tasks (AIME-24, GPQA-Diamond, MATH-500)

	\| \| Recovery (%) \| Average Score \| AIME 2024<br>pass@1 \| MATH-500<br>pass@1 \| GPQA Diamond<br>pass@1 \|
	\| -------------------------------------------- \| :----------: \| :-----------: \| :-----------------: \| :----------------: \| :--------------------: \|
	\| deepseek/DeepSeek-R1 \| 100.00 \| 82.99 \| 78.33 \| 97.24 \| 73.38 \|
	\| cognitivecomputations/DeepSeek-R1-AWQ \| 94.29 \| 78.25 \| 70.67 \| 93.64 \| 70.46 \|
	\| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g <br> (this model) \| 96.52 \| 80.10 \| 72.96 \| 97.09 \| 70.26 \|
	\| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts \| 98.81 \| 82.00 \| 77.00 \| 97.08 \| 71.92 \|

	## Reproduction

	The results were obtained using the following commands:

	`OpenLLM v1`
	```bash
	MODEL=ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-act_order-mse_scale
	MODEL_ARGS="pretrained=$MODEL,dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=8,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True"

	lm_eval \
	--model vllm \
	--model_args $MODEL_ARGS \
	--tasks openllm \
	--batch_size auto
	```

	For reasoning evals we adopted the protocol from the [open-r1 repository](https://github.com/huggingface/open-r1).

	`Reasoning tasks`
	```bash
	MODEL=ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-act_order-mse_scale
	MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=38768,gpu_memory_utilization=0.8,tensor_parallel_size=8,add_special_tokens=false,generation_parameters={\"max_new_tokens\":32768,\"temperature\":0.6,\"top_p\":0.95,\"seed\":7686}"

	export VLLM_WORKER_MULTIPROC_METHOD=spawn
	lighteval vllm $MODEL_ARGS "custom\|aime24\|0\|0,custom\|math_500\|0\|0,custom\|gpqa:diamond\|0\|0" \
	--custom-tasks src/open_r1/evaluate.py \
	--use-chat-template \
	--output-dir $OUTPUT_DIR
	```

	Please use this version of vLLM: https://github.com/vllm-project/vllm/pull/16038

	## Performance benchmarking
	We follow the standard vLLM performance benchmarking with ShareGPT dataset and observe the following metrics (lower is better):

	\| \| Time to First Token<br>Median TTFT (ms) ↓ \| Time per Output Token<br>Median TPOT (ms) ↓ \| Inter-token Latency<br>Median ITL (ms) ↓ \|
	\| -------------------------------------------- \| :-------------------------------------: \| :---------------------------------------: \| :------------------------------------: \|
	\| cognitivecomputations/DeepSeek-R1-AWQ \| 1585.45 \| 55.41 \| 43.06 \|
	\| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g-experts \| 1344.68 \| 41.49 \| 36.33 \|
	\| ISTA-DASLab/DeepSeek-R1-GPTQ-4b-128g <br> (this model) \| 815.19 \| 44.65 \| 37.88 \|

	GPTQ models are faster across all metrics than AWQ models because GPTQ uses less bits-per-parameter than AWQ. More specifically, AWQ has to use smaller group-size of 64 (vs 128 in GPTQ) to preserve accuracy, and zero-points due to asymmetric quantization.

	## Contributors
	Denis Kuznedelev (Yandex), Eldar Kurtić (Red Hat AI & ISTA), Jiale Chen (ISTA), Michael Goin (Red Hat AI), Elias Frantar (ISTA), Dan Alistarh (Red Hat AI & ISTA).