Instructions to use Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4") model = AutoModelForCausalLM.from_pretrained("Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4
- SGLang
How to use Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4 with Docker Model Runner:
docker model run hf.co/Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4
DeepSeek-R1-Distill-Llama-70B — NVFP4 (compressed-tensors)
Built with Llama.
NVFP4 (4-bit floating-point, W4A4, group_size=16) quantization of deepseek-ai/DeepSeek-R1-Distill-Llama-70B, produced via a distributed 2-node pipeline on NVIDIA DGX Spark (GB10) hardware.
To my knowledge this is the first publicly available NVFP4 of the DeepSeek-R1-Distill-Llama-70B base — the top non-RP reasoning model in the 70B class with ~4.5 M downloads on the original.
Quick facts
| Base model | deepseek-ai/DeepSeek-R1-Distill-Llama-70B (Llama-3.3-70B distilled from R1) |
| Architecture | LlamaForCausalLM, 80 layers, hidden_size=8192, 64 attn heads, 8 KV heads, head_dim=128 |
| Original size | ~132 GB (BF16) |
| Quantized size | ~40 GB (see Files tab) |
| Quant format | NVFP4 via nvidia-modelopt 0.43.0 |
| Storage layout | compressed-tensors (vLLM-native) |
| lm_head | Kept BF16 (unquantized), in quantization_config.ignore |
| KV cache | Configurable at serve time (FP8 recommended) |
| Calibration data | 256 samples from cnn_dailymail, lengths 150–1200 tokens |
| Conversion date | 2026-05-15 |
Why this exists
DeepSeek-R1-Distill-Llama-70B is the most-downloaded non-RP reasoning model in the 70B-class (4.5 M downloads on the original), and until now had no public NVFP4 quantization despite being a perfect target — Llama-3.3 architecture, 70B fits cleanly on a single 128 GB UMA DGX Spark in NVFP4 with massive KV-cache headroom for long reasoning chains.
This release closes that gap with a production-quality 256-sample calibration run on a 2-Spark Ray cluster, using the same pipeline that produced Anubis-Pro-105B-NVFP4 and Behemoth-X-123B-v2.2-NVFP4 — open at github.com/KaletoAI/distrib-nvfp4 (Apache 2.0).
For 70B-class models the distributed pipeline is honestly overkill (the model fits on one Spark for quantization too), but it's the same toolchain so reusing it is free. The benefit: identical workflow, identical fix-list applied, identical reproducibility as the larger releases.
Quantization Pipeline (short version)
Two Ray actors own 40 layers each. modelopt's mtq.quantize(wrapper, NVFP4_DEFAULT_CFG, forward_loop=None) inserts the W4A4 quantizers in calibration mode without running its own forward; the driver routes hidden states between actors via Ray RPC for each of 256 calibration samples.
After finalize, per-actor disk-eviction (cloudpickle for modelopt's dynamic QuantLinear), then streaming per-layer NVFP4 export via mte.export_hf_checkpoint on a 1-layer template (with use_cache=False). Driver merges per-actor shards, renames layer indices on shard 1 with the +40 offset, copies tokenizer (DeepSeek uses tiktoken BPE — no tokenizer.model file), patches config.json to keep lm_head BF16, and injects input_scale=1.0 for every weight quantizer (modelopt 0.43 omits these but vLLM's loader requires them).
Calibration health on the run that produced this artifact:
- shard0 (layers 0–39 + embed): good=280, zero=0, nan=0
- shard1 (layers 40–79 + norm + lm_head): good=280, zero=0, nan=0
(NVFP4_DEFAULT_CFG inserts 7 quantizers per layer for Llama arch.)
Total pipeline time: 25 min on 2× DGX Spark (IB-connected at 10.20.0.x). Load 3 min, calibrate ~15 min, eviction 105 s, export 110 s, merge 25 s.
Performance
Stock-vLLM bench will follow as separate update; pattern is consistent with the related Anubis-Pro and Behemoth releases:
Anubis-Pro-105B-NVFP4 (for reference):
- Stock vLLM: ~3.1 tok/s decode short context
- MARLIN+FlashInfer: 3.78 tok/s (+22 %)
DeepSeek-R1-Distill-Llama-70B-NVFP4 (this model):
- Expected to be faster than both Anubis (105B) and Behemoth (123B) due to smaller size
- Estimate ~4.5–5.5 tok/s decode on the MARLIN+FlashInfer stack
- Will measure and update once the model is benched on Spark
For reasoning workloads (long chain-of-thought outputs) on a single Spark, this model is the sweet spot — 70B class, fits with ample KV-cache pool, and the NVFP4 quality preservation at W4A4 retains the R1-distilled reasoning behaviour.
Usage
vLLM (direct)
Recommended on GB10 — the tuned Spark stack with MARLIN GEMM + FlashInfer attention:
VLLM_NVFP4_GEMM_BACKEND=marlin \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
VLLM_MARLIN_USE_ATOMIC_ADD=1 \
vllm serve /path/to/DeepSeek-R1-Distill-Llama-70B-NVFP4 \
--served-model-name DeepSeek-R1-Distill-Llama-70B-NVFP4 \
--attention-backend flashinfer \
--quantization compressed-tensors \
--dtype auto \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.80 \
--enable-chunked-prefill \
--enable-prefix-caching \
--port 9007
--gpu-memory-utilization 0.80 for the 40 GB DeepSeek NVFP4 leaves ~62 GB of KV-cache pool on a 128 GB UMA Spark — long enough for 32 K context with max-num-seqs 4 and a healthy chain-of-thought reasoning buffer. Bump to 0.85 if you want more concurrency.
llama-swap entry
"DeepSeek-R1-Distill-Llama-70B-NVFP4":
proxy: "http://127.0.0.1:9007"
ttl: 0
checkEndpoint: "/health"
env:
- "VLLM_NVFP4_GEMM_BACKEND=marlin"
- "VLLM_TEST_FORCE_FP8_MARLIN=1"
- "VLLM_MARLIN_USE_ATOMIC_ADD=1"
cmd: >-
/home/<user>/vllm-env/bin/python3 -m vllm.entrypoints.openai.api_server
--model /home/<user>/models/DeepSeek-R1-Distill-Llama-70B-NVFP4
--attention-backend flashinfer
--served-model-name DeepSeek-R1-Distill-Llama-70B-NVFP4
--quantization compressed-tensors
--dtype auto
--kv-cache-dtype fp8
--max-model-len 32768
--max-num-seqs 4
--gpu-memory-utilization 0.80
--trust-remote-code
--enable-chunked-prefill
--enable-prefix-caching
--port 9007
--host 127.0.0.1
Recommended sampling (from DeepSeek's original card)
R1-distilled models perform best with:
temperature: 0.6top_p: 0.95- Avoid system prompts — DeepSeek-R1 family expects user-first conversation flow
- For reasoning tasks: let the
<think>...</think>block grow uncapped; set--max-tokenshigh (4096+)
Files in this repository
model-NNNNN-of-00008.safetensors— 8 shards, NVFP4-packed weights + scales (~40 GB total)model.safetensors.index.json— weight map (~2 403 keys: 80 layers × 7 quant linears × 4 keys + norms + embed + lm_head + injected input_scale)config.json— Llama config withquantization_config.ignore=["lm_head"]andinput_activations.dynamic: truehf_quant_config.json,generation_config.json— auxiliary configstokenizer.json,tokenizer_config.json— DeepSeek tokenizer (tiktoken BPE; notokenizer.modelfile)
Recent fixes baked into the conversion
modelopt 0.43's NVFP4 export needs six gotchas worked around before vLLM will serve the output without producing garbage. All applied automatically by the pipeline:
- Phase-6 1-layer template needs
vocab_size=2(not 1) because modelopt'sllm_dummy_forwardfeedstorch.ones([1, 2]). - Phase-6 template needs
pad_token_id=None/bos/eos=None— pad-eos consistency assertion otherwise. - Phase-6 must NOT clear
_calibratoron quantized modules. - Per-actor exports omit
input_scalekeys; vLLM produces garbage decoding unlessinput_scale=1.0is injected per.weight_scale_2key. - Merged
config.jsonneedsinput_activations.dynamic: true(modelopt writes false but emits no static scale). - Merged config must restore
num_hidden_layers,vocab_size, pad/bos/eos token IDs from source.
(Plus three N-shard-specific fixes for the 3-shard Behemoth release — not exercised here since DeepSeek is 2-shard.)
Acknowledgments
- DeepSeek-AI for the original R1-Distill-Llama-70B
- Avarok-Cybersecurity (
tbraun96) for the MARLIN-backend NVFP4 GEMM port — drives the ~+22 % decode speedup on Spark - entrpi / antirez for the parallel hybrid-quant work on the MoE side of the Spark ecosystem (DeepSeek-V4-Flash) — different recipe, same Spark constraints
- saricles for setting the bar on GB10-tuned NVFP4 calibration recipes
- NVIDIA for the DGX Spark / GB10 platform, the NVFP4 format, and modelopt
- vLLM project for compressed-tensors NVFP4 inference support
License
MIT, inherited from deepseek-ai/DeepSeek-R1-Distill-Llama-70B. Pipeline code under Apache 2.0 at github.com/KaletoAI/distrib-nvfp4.
Status
Single-author release. Issues + feedback welcome — both on the model artifact and on the pipeline that built it.
- Downloads last month
- -
Model tree for Kaleto/DeepSeek-R1-Distill-Llama-70B-NVFP4
Base model
deepseek-ai/DeepSeek-R1-Distill-Llama-70B