DeepSeek-R1-Distill-Llama-8B-NVFP4

This is an NVFP4 quantized version of deepseek-ai/DeepSeek-R1-Distill-Llama-8B, optimized for NVIDIA GPUs using TensorRT-LLM.

Quantization Details

Property Value
Base Model deepseek-ai/DeepSeek-R1-Distill-Llama-8B
Quantization Method NVFP4 (2-bit weights + 4-bit scales)
Calibration Dataset CNN/DailyMail
Calibration Samples 512
Tool NVIDIA TensorRT Model Optimizer v0.35.0
Export Format Hugging Face

Hardware Requirements

  • GPU: NVIDIA GPU with FP4 support (Blackwell, Ada Lovelace, or newer)
  • VRAM: ~40GB recommended
  • Tested on: NVIDIA DGX Spark (GB10)

Usage

With TensorRT-LLM

from tensorrt_llm import LLM

llm = LLM(model="YOUR_USERNAME/DeepSeek-R1-Distill-Llama-8B-NVFP4")
output = llm.generate("Paris is great because")
print(output)

With TensorRT-LLM Server

trtllm-serve YOUR_USERNAME/DeepSeek-R1-Distill-Llama-8B-NVFP4 \
  --backend pytorch \
  --port 8000

Limitations

  • Requires TensorRT-LLM for inference
  • Not compatible with standard transformers library
  • Optimized for NVIDIA GPUs only

License

This model inherits the license from the base model. See DeepSeek license.

Acknowledgments

Downloads last month
12
Safetensors
Model size
5B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4

Quantized
(190)
this model