Instructions to use QuantTrio/DeepSeek-V3.2-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use QuantTrio/DeepSeek-V3.2-AWQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="QuantTrio/DeepSeek-V3.2-AWQ") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("QuantTrio/DeepSeek-V3.2-AWQ", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use QuantTrio/DeepSeek-V3.2-AWQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "QuantTrio/DeepSeek-V3.2-AWQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuantTrio/DeepSeek-V3.2-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/QuantTrio/DeepSeek-V3.2-AWQ
- SGLang
How to use QuantTrio/DeepSeek-V3.2-AWQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "QuantTrio/DeepSeek-V3.2-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuantTrio/DeepSeek-V3.2-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "QuantTrio/DeepSeek-V3.2-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuantTrio/DeepSeek-V3.2-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use QuantTrio/DeepSeek-V3.2-AWQ with Docker Model Runner:
docker model run hf.co/QuantTrio/DeepSeek-V3.2-AWQ
The model startup using vllm failed.
Follow the vllm installation method provided in the document:
# install vllm
pip install vllm==0.11.2
# install deep_gemm
git clone https://github.com/deepseek-ai/DeepGEMM.git
cd DeepGEMM/third-party
git clone https://github.com/NVIDIA/cutlass.git
git clone https://github.com/fmtlib/fmt.git
cd ../
git checkout v2.1.1.post3
pip install . --no-build-isolation
An error occurred when starting vllm:
ValueError: No valid attention backend found for cuda with head_size: 576, dtype: torch.bfloat16, kv_cache_dtype: auto, block_size: 64, use_mla: True, has_sink: False, use_sparse: True. Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.
When using vllm v0.13.0, the following error occurred during startup
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.
The startup command for vllm is as follows:
export VLLM_USE_DEEP_GEMM=0 # ATM, this line is a "must" for Hopper devices
export TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4
vllm serve \
__YOUR_PATH__/QuantTrio/DeepSeek-V3.2-AWQ \
--served-model-name MY_MODEL_NAME \
--enable-auto-tool-choice \
--tool-call-parser deepseek_v31 \
--reasoning-parser deepseek_v3 \
--swap-space 16 \
--max-num-seqs 32 \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 8 \
--enable-expert-parallel \ # optional
--speculative-config '{"model": "__YOUR_PATH__/QuantTrio/DeepSeek-V3.2-AWQ", "num_speculative_tokens": 1}' \ # optional, 50%+- throughput increase is observed
--trust-remote-code \
--host 0.0.0.0 \
--port 8000
Please help me. How can I properly start it? Thank you.
What GPU device are you using?
This model uses a sparse‑attention mechanism, and for now it can only run on H‑series and B‑series cards. Older GPUs are not yet supported.
I have the same problem:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.
Using 8x RTX Blackwell 6000
Parameter:VLLM_USE_DEEP_GEMM=1
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
VLLM_USE_FLASHINFER_MOE_FP16=1
VLLM_USE_FLASHINFER_SAMPLER=0
OMP_NUM_THREADS=4
vllm serve QuantTrio/DeepSeek-V3.2-AWQ
--host 192.168.xxx.yyy
--port 8000
--enable-auto-tool-choice
--tool-call-parser deepseek_v31
--reasoning-parser deepseek_v3
--swap-space 16
--max-num-seqs 32
--gpu-memory-utilization 0.9
--trust-remote-code
--served-model-name "vllm_thinkingparam"
--tensor-parallel-size 8
--enable-expert-parallel
--speculative-config '{"model": "QuantTrio/DeepSeek-V3.2-AWQ", "num_speculative_tokens": 1}'
--max_model_len $token
Have you all tried the one from vLLM official guide for Deepseek-V3.2?
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
uv pip install git+https://github.com/deepseek-ai/DeepGEMM.git@v2.1.1.post3 --no-build-isolation # Other versions may also work. We recommend using the latest released version from https://github.com/deepseek-ai/DeepGEMM/releases
Yeah, i tried this - ending in the same problem:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.
Are the SM120 (RTX Blackwell) supported? For me it seems they arent
Yeah, i tried this - ending in the same problem:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.Are the SM120 (RTX Blackwell) supported? For me it seems they arent
Could you try to edit the config.json file, change "torch_dtype": "bfloat16" to "torch_dtype": "float16"
Then have a try one more time. If this still doesn't work, then it probably indeed doesn't work 🥲
Yeah, i tried it but unfortunally a pretty similar errormessage occurs:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.float16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [dtype not supported, compute capability not supported]}.
:(
Yeah, i tried it but unfortunally a pretty similar errormessage occurs:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.float16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [dtype not supported, compute capability not supported]}.:(
🥲
What GPU device are you using?
This model uses a sparse‑attention mechanism, and for now it can only run on H‑series and B‑series cards. Older GPUs are not yet supported.
I'm using 8*A100.
What GPU device are you using?
This model uses a sparse‑attention mechanism, and for now it can only run on H‑series and B‑series cards. Older GPUs are not yet supported.I'm using 8*A100.
As above, I tested with 8×A100 and encountered the same issue. We need to wait for vLLM to support Sparse Attention on the Ampere architecture.
Same issue with rtx pro 6000 x4