DeepSeek-R1-Distill-Llama-8B-NVFP4
This is an NVFP4 quantized version of deepseek-ai/DeepSeek-R1-Distill-Llama-8B, optimized for NVIDIA GPUs using TensorRT-LLM.
Quantization Details
| Property | Value |
|---|---|
| Base Model | deepseek-ai/DeepSeek-R1-Distill-Llama-8B |
| Quantization Method | NVFP4 (2-bit weights + 4-bit scales) |
| Calibration Dataset | CNN/DailyMail |
| Calibration Samples | 512 |
| Tool | NVIDIA TensorRT Model Optimizer v0.35.0 |
| Export Format | Hugging Face |
Hardware Requirements
- GPU: NVIDIA GPU with FP4 support (Blackwell, Ada Lovelace, or newer)
- VRAM: ~40GB recommended
- Tested on: NVIDIA DGX Spark (GB10)
Usage
With TensorRT-LLM
from tensorrt_llm import LLM
llm = LLM(model="YOUR_USERNAME/DeepSeek-R1-Distill-Llama-8B-NVFP4")
output = llm.generate("Paris is great because")
print(output)
With TensorRT-LLM Server
trtllm-serve YOUR_USERNAME/DeepSeek-R1-Distill-Llama-8B-NVFP4 \
--backend pytorch \
--port 8000
Limitations
- Requires TensorRT-LLM for inference
- Not compatible with standard transformers library
- Optimized for NVIDIA GPUs only
License
This model inherits the license from the base model. See DeepSeek license.
Acknowledgments
- Downloads last month
- 12
Model tree for amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4
Base model
deepseek-ai/DeepSeek-R1-Distill-Llama-8B