DeepSeek-R1-Distill-Llama-70B-NVFP4
NVFP4 quantized version of DeepSeek-R1-Distill-Llama-70B using custom Blackwell NVFP4 GEMM kernels.
140 GB → 40 GB (0.29x) with vision tower excluded.
NVFP4 Quantization Details
| Property | Value |
|---|---|
| Base model | deepseek-ai/DeepSeek-R1-Distill-Llama-70B |
| Quantization | NVFP4 (W4A4 — weights FP4 E2M1, activations FP4, scales FP8 E4M3) |
| Format | compressed-tensors (native vLLM support) |
| Tool | vllm-project/llm-compressor v0.10.0.2 |
| Calibration | 128 samples, ultrachat-200k (train_sft split), max_seq_length 2048 |
| Size | 40 GB (single safetensors shard set) |
| Requires | NVIDIA Blackwell GPU (SM 120), vLLM >= 0.19 |
Recipe
QuantizationModifier:
targets: [Linear]
ignore: [lm_head]
scheme: NVFP4
Usage
vLLM
vllm serve PiehSoft/DeepSeek-R1-Distill-Llama-70B-NVFP4 \
--host 0.0.0.0 \
--port 8081 \
--max-model-len 8192
Python
from vllm import LLM, SamplingParams
llm = LLM(model="PiehSoft/DeepSeek-R1-Distill-Llama-70B-NVFP4")
output = llm.generate("What is the meaning of life?", SamplingParams(max_tokens=256))
print(output[0].outputs[0].text)
Benchmarks
Tested on RTX PRO 6000 Blackwell 96GB:
| Backend | Generation tok/s | Prompt tok/s |
|---|---|---|
| vLLM 0.19.0 | 25.0 | 176.3 |
| llama.cpp (GGUF variant) | 33.6 | 196.5 |
GGUF Version
A GGUF version of this model is available at PiehSoft/DeepSeek-R1-Distill-Llama-70B-NVFP4-GGUF for use with llama.cpp.
Credits
Quantized by PiehSoft (William Pieh) on NVIDIA RTX PRO 6000 Blackwell 96GB.
- Downloads last month
- 19
Model tree for PiehSoft/DeepSeek-R1-Distill-Llama-70B-NVFP4
Base model
deepseek-ai/DeepSeek-R1-Distill-Llama-70B