Instructions to use amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4")
model = AutoModelForCausalLM.from_pretrained("amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

TensorRT

How to use amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4 with TensorRT:

# No code snippets available yet for this library.

# To use this model, check the repository files and the library's documentation.

# Want to help? PRs adding snippets are welcome at:
# https://github.com/huggingface/huggingface.js

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4

SGLang

How to use amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4 with Docker Model Runner:
```
docker model run hf.co/amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4
```

DeepSeek-R1-Distill-Llama-8B-NVFP4

This is an NVFP4 quantized version of deepseek-ai/DeepSeek-R1-Distill-Llama-8B, optimized for NVIDIA GPUs using TensorRT-LLM.

Quantization Details

Property	Value
Base Model	deepseek-ai/DeepSeek-R1-Distill-Llama-8B
Quantization Method	NVFP4 (2-bit weights + 4-bit scales)
Calibration Dataset	CNN/DailyMail
Calibration Samples	512
Tool	NVIDIA TensorRT Model Optimizer v0.35.0
Export Format	Hugging Face

Hardware Requirements

GPU: NVIDIA GPU with FP4 support (Blackwell, Ada Lovelace, or newer)
VRAM: ~40GB recommended
Tested on: NVIDIA DGX Spark (GB10)

Usage

With TensorRT-LLM

from tensorrt_llm import LLM

llm = LLM(model="amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4")
output = llm.generate("Paris is great because")
print(output)

With TensorRT-LLM Server

trtllm-serve amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4 \
  --backend pytorch \
  --port 8000

Limitations

Requires TensorRT-LLM for inference
Not compatible with standard transformers library
Optimized for NVIDIA GPUs only

License

This model inherits the license from the base model. See DeepSeek license.

Acknowledgments

DeepSeek for the base model
NVIDIA for TensorRT Model Optimizer

Downloads last month: 18

Safetensors

Model size

5B params

Tensor type

BF16

F8_E4M3

Model tree for amer8/DeepSeek-R1-Distill-Llama-8B-NVFP4

Base model

deepseek-ai/DeepSeek-R1-Distill-Llama-8B

Quantized

(191)

this model