Instructions to use QuantTrio/DeepSeek-V3.2-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use QuantTrio/DeepSeek-V3.2-AWQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="QuantTrio/DeepSeek-V3.2-AWQ")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("QuantTrio/DeepSeek-V3.2-AWQ", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use QuantTrio/DeepSeek-V3.2-AWQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "QuantTrio/DeepSeek-V3.2-AWQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/DeepSeek-V3.2-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/QuantTrio/DeepSeek-V3.2-AWQ

SGLang

How to use QuantTrio/DeepSeek-V3.2-AWQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "QuantTrio/DeepSeek-V3.2-AWQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/DeepSeek-V3.2-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "QuantTrio/DeepSeek-V3.2-AWQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/DeepSeek-V3.2-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use QuantTrio/DeepSeek-V3.2-AWQ with Docker Model Runner:
```
docker model run hf.co/QuantTrio/DeepSeek-V3.2-AWQ
```

The model startup using vllm failed.

by beausoft - opened Dec 31, 2025

Discussion

beausoft

Dec 31, 2025

Follow the vllm installation method provided in the document:

# install vllm
pip install vllm==0.11.2
# install deep_gemm
git clone https://github.com/deepseek-ai/DeepGEMM.git
cd DeepGEMM/third-party
git clone https://github.com/NVIDIA/cutlass.git
git clone https://github.com/fmtlib/fmt.git
cd ../
git checkout v2.1.1.post3
pip install . --no-build-isolation

An error occurred when starting vllm:

ValueError: No valid attention backend found for cuda with head_size: 576, dtype: torch.bfloat16, kv_cache_dtype: auto, block_size: 64, use_mla: True, has_sink: False, use_sparse: True. Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.

When using vllm v0.13.0, the following error occurred during startup

ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.

The startup command for vllm is as follows:

export VLLM_USE_DEEP_GEMM=0  # ATM, this line is a "must" for Hopper devices
export TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4

vllm serve \
    __YOUR_PATH__/QuantTrio/DeepSeek-V3.2-AWQ \
    --served-model-name MY_MODEL_NAME \
    --enable-auto-tool-choice \
    --tool-call-parser deepseek_v31 \
    --reasoning-parser deepseek_v3 \
    --swap-space 16 \
    --max-num-seqs 32 \
    --gpu-memory-utilization 0.9 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \  # optional
    --speculative-config '{"model": "__YOUR_PATH__/QuantTrio/DeepSeek-V3.2-AWQ", "num_speculative_tokens": 1}' \  # optional, 50%+- throughput increase is observed
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

Please help me. How can I properly start it? Thank you.

JunHowie

QuantTrio org Dec 31, 2025

•

edited Dec 31, 2025

What GPU device are you using?
This model uses a sparse‑attention mechanism, and for now it can only run on H‑series and B‑series cards. Older GPUs are not yet supported.

mullerse

Jan 3

•

edited Jan 3

I have the same problem:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.

Using 8x RTX Blackwell 6000

Parameter:
VLLM_USE_DEEP_GEMM=1 TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1 VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_USE_FLASHINFER_SAMPLER=0 OMP_NUM_THREADS=4 vllm serve QuantTrio/DeepSeek-V3.2-AWQ --host 192.168.xxx.yyy --port 8000 --enable-auto-tool-choice --tool-call-parser deepseek_v31 --reasoning-parser deepseek_v3 --swap-space 16 --max-num-seqs 32 --gpu-memory-utilization 0.9 --trust-remote-code --served-model-name "vllm_thinkingparam" --tensor-parallel-size 8 --enable-expert-parallel --speculative-config '{"model": "QuantTrio/DeepSeek-V3.2-AWQ", "num_speculative_tokens": 1}' --max_model_len $token

tclf90

QuantTrio org Jan 4

Have you all tried the one from vLLM official guide for Deepseek-V3.2?

source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
uv pip install git+https://github.com/deepseek-ai/DeepGEMM.git@v2.1.1.post3 --no-build-isolation # Other versions may also work. We recommend using the latest released version from https://github.com/deepseek-ai/DeepGEMM/releases

mullerse

Jan 4

Yeah, i tried this - ending in the same problem:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.

Are the SM120 (RTX Blackwell) supported? For me it seems they arent

tclf90

QuantTrio org Jan 4

Yeah, i tried this - ending in the same problem:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.

Are the SM120 (RTX Blackwell) supported? For me it seems they arent

Could you try to edit the config.json file, change "torch_dtype": "bfloat16" to "torch_dtype": "float16"
Then have a try one more time. If this still doesn't work, then it probably indeed doesn't work 🥲

mullerse

Jan 4

•

edited Jan 4

Yeah, i tried it but unfortunally a pretty similar errormessage occurs:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.float16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [dtype not supported, compute capability not supported]}.

tclf90

QuantTrio org Jan 5

Yeah, i tried it but unfortunally a pretty similar errormessage occurs:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.float16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [dtype not supported, compute capability not supported]}.

:(

🥲

beausoft

Jan 5

What GPU device are you using?
This model uses a sparse‑attention mechanism, and for now it can only run on H‑series and B‑series cards. Older GPUs are not yet supported.

I'm using 8*A100.

JunHowie

QuantTrio org Jan 5

What GPU device are you using?
This model uses a sparse‑attention mechanism, and for now it can only run on H‑series and B‑series cards. Older GPUs are not yet supported.

I'm using 8*A100.

As above, I tested with 8×A100 and encountered the same issue. We need to wait for vLLM to support Sparse Attention on the Ampere architecture.

fanhed

Jan 9

Same issue with rtx pro 6000 x4

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment