Instructions to use Kaleto/Llama-3.3-70B-Instruct-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Kaleto/Llama-3.3-70B-Instruct-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Kaleto/Llama-3.3-70B-Instruct-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Kaleto/Llama-3.3-70B-Instruct-NVFP4")
model = AutoModelForCausalLM.from_pretrained("Kaleto/Llama-3.3-70B-Instruct-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Kaleto/Llama-3.3-70B-Instruct-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Kaleto/Llama-3.3-70B-Instruct-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Kaleto/Llama-3.3-70B-Instruct-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Kaleto/Llama-3.3-70B-Instruct-NVFP4

SGLang

How to use Kaleto/Llama-3.3-70B-Instruct-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Kaleto/Llama-3.3-70B-Instruct-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Kaleto/Llama-3.3-70B-Instruct-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Kaleto/Llama-3.3-70B-Instruct-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Kaleto/Llama-3.3-70B-Instruct-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Kaleto/Llama-3.3-70B-Instruct-NVFP4 with Docker Model Runner:
```
docker model run hf.co/Kaleto/Llama-3.3-70B-Instruct-NVFP4
```

Llama-3.3-70B-Instruct — NVFP4 (compressed-tensors)

Built with Llama.

NVFP4 (4-bit floating-point, W4A4, group_size=16) quantization of meta-llama/Llama-3.3-70B-Instruct, produced via a distributed 2-node pipeline on NVIDIA DGX Spark (GB10) hardware.

To my knowledge this is the first publicly available NVFP4 of vanilla Llama-3.3-70B-Instruct — the highest-reach single Llama model in the 70B class with ~11.6 M downloads on the original Meta base.

Quick facts


Base model	meta-llama/Llama-3.3-70B-Instruct (Meta Llama 3.3, gated)
Architecture	LlamaForCausalLM, 80 layers, hidden_size=8192, 64 attn heads, 8 KV heads, head_dim=128
Original size	~141 GB (BF16)
Quantized size	~40 GB (see Files tab)
Quant format	NVFP4 via nvidia-modelopt 0.43.0
Storage layout	compressed-tensors (vLLM-native)
lm_head	Kept BF16 (unquantized), in `quantization_config.ignore`
KV cache	Configurable at serve time (FP8 recommended)
Calibration data	256 samples from `cnn_dailymail`, lengths 150–1200 tokens
Conversion date	2026-05-15

Why this exists

Vanilla Llama-3.3-70B-Instruct is Meta's flagship 70B instruct model — strong on instruction following, multilingual (8 languages), and the de-facto baseline most downstream finetunes start from. Despite 11.6 M downloads on the original base, no publicly available NVFP4 quantization existed before this release. This closes that gap.

NVFP4 is NVIDIA's hardware-accelerated 4-bit floating-point format introduced with Blackwell — natively supported by Spark/GB10, 5090, B100. Quality lands roughly in the Q5-Q6 GGUF range at Q4 size, with hardware-accelerated GEMM kernels making it faster than GGUF on Blackwell.

Pipeline source: github.com/KaletoAI/distrib-nvfp4 (Apache 2.0). Same toolchain that produced Anubis-Pro-105B-NVFP4, Behemoth-X-123B-v2.2-NVFP4, and DeepSeek-R1-Distill-Llama-70B-NVFP4.

Quantization Pipeline (short version)

Two Ray actors own 40 layers each on a 2-Spark cluster (ConnectX-7 IB backbone). modelopt's mtq.quantize(wrapper, NVFP4_DEFAULT_CFG, forward_loop=None) inserts the W4A4 quantizers in calibration mode; the driver routes hidden states between actors via Ray RPC for each of 256 calibration samples.

After finalize, per-actor disk-eviction (with cloudpickle as pickle_module — modelopt 0.43's QuantLinear is a dynamically-generated subclass that vanilla pickle can't serialize), then streaming per-layer NVFP4 export via mte.export_hf_checkpoint on a 1-layer template (with use_cache=False and layer.self_attn.layer_idx=0 reset to dodge a transformers DynamicCache shape mismatch). Driver merges per-actor shards, renames layer indices on shard 1 with the +40 offset, copies tokenizer files, patches config.json to keep lm_head BF16 and inject input_scale=1.0 for every weight quantizer (modelopt 0.43 omits these but vLLM's loader requires them).

Calibration health on the run that produced this artifact: clean (no NaN, no zero quantizers, all 560 Linears per shard quantized).

Total pipeline time: ~25 min on 2× DGX Spark IB-cluster.

Performance

vLLM bench will follow as a separate update; pattern is consistent with the sibling Anubis-Pro-105B and Behemoth-X-123B releases:

Reference numbers from related releases on the same hardware (single Spark, vLLM 0.20.2rc1, Avarok-stack env vars):

Model	Decode tok/s	Cold load
Anubis-Pro-105B-NVFP4 (Llama-3.3 105B)	3.78 tok/s short ctx	~520 s
Behemoth-X-123B-NVFP4 (Mistral-Large 123B)	3.21 tok/s short ctx	~430 s
Llama-3.3-70B-Instruct-NVFP4 (this)	expected ~4.5-5.5 tok/s (smaller model)	~280-350 s

70B class on a 128 GB Spark UMA gives generous KV-cache pool — should comfortably serve 32 K context at --max-num-seqs 4.

Usage

vLLM (direct)

Recommended on GB10 — the tuned Spark stack:

VLLM_NVFP4_GEMM_BACKEND=marlin \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
VLLM_MARLIN_USE_ATOMIC_ADD=1 \
vllm serve /path/to/Llama-3.3-70B-Instruct-NVFP4 \
  --served-model-name Llama-3.3-70B-Instruct-NVFP4 \
  --attention-backend flashinfer \
  --quantization compressed-tensors \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.80 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --port 9008

--gpu-memory-utilization 0.80 for the 40 GB Llama-3.3 NVFP4 leaves ~62 GB of KV-cache pool on a 128 GB UMA Spark — generous for 32 K context. Bump to 0.85 for more concurrency.

llama-swap entry

"Llama-3.3-70B-Instruct-NVFP4":
  proxy: "http://127.0.0.1:9008"
  ttl: 0
  checkEndpoint: "/health"
  env:
    - "VLLM_NVFP4_GEMM_BACKEND=marlin"
    - "VLLM_TEST_FORCE_FP8_MARLIN=1"
    - "VLLM_MARLIN_USE_ATOMIC_ADD=1"
  cmd: >-
    /home/<user>/vllm-env/bin/python3 -m vllm.entrypoints.openai.api_server
    --model /home/<user>/models/Llama-3.3-70B-Instruct-NVFP4
    --attention-backend flashinfer
    --served-model-name Llama-3.3-70B-Instruct-NVFP4
    --quantization compressed-tensors
    --dtype auto
    --kv-cache-dtype fp8
    --max-model-len 32768
    --max-num-seqs 4
    --gpu-memory-utilization 0.80
    --trust-remote-code
    --enable-chunked-prefill
    --enable-prefix-caching
    --port 9008
    --host 127.0.0.1

Recommended sampling

Llama-3.3-Instruct uses the standard Llama 3 chat template with system / user / assistant roles. Default sampling that works well:

temperature: 0.6 - 0.7
top_p: 0.9
min_p: 0.05
repetition_penalty: 1.0 (don't add — Llama-3.3 doesn't need it)
System prompt: use one. Llama-3.3-Instruct is heavily system-prompt-tuned

For tool use / function calling: Llama-3.3-Instruct supports the standard <|tool_call|>...<|/tool_call|> flow. The quantization preserves this behaviour.

Files in this repository

model-NNNNN-of-00008.safetensors — 8 shards, NVFP4-packed weights + scales (~40 GB total)
model.safetensors.index.json — weight map (~2 403 keys: 80 layers × 7 quant linears × 4 keys + norms + embed + lm_head + injected input_scale)
config.json — Llama config with quantization_config.ignore=["lm_head"] and input_activations.dynamic: true
hf_quant_config.json, generation_config.json — auxiliary configs
tokenizer.json, tokenizer_config.json, special_tokens_map.json — Llama-3.3 tokenizer (tiktoken-style, no tokenizer.model)

Recent fixes baked into the conversion

modelopt 0.43's NVFP4 export needs six gotchas worked around before vLLM will serve the output without producing garbage. All applied automatically by the pipeline:

Phase-6 1-layer template needs vocab_size=2 (not 1) because modelopt's llm_dummy_forward feeds torch.ones([1, 2]).
Phase-6 template needs pad_token_id=None/bos/eos=None — pad-eos consistency assertion otherwise.
Phase-6 must NOT clear _calibrator on quantized modules.
Per-actor exports omit input_scale keys; vLLM produces garbage decoding unless input_scale=1.0 is injected per .weight_scale_2 key.
Merged config.json needs input_activations.dynamic: true (modelopt writes false but emits no static scale).
Merged config must restore num_hidden_layers, vocab_size, pad/bos/eos token IDs from source.

(Three additional N-shard-specific fixes are documented in the Behemoth-X-123B model card — not exercised here since Llama-3.3 fits comfortably in a 2-shard split.)

Acknowledgments

Meta for the original Llama 3.3 base model and the Community License
Avarok-Cybersecurity (tbraun96) for the MARLIN-backend NVFP4 GEMM port — drives the ~+22 % decode speedup on Spark
saricles for setting the bar on GB10-tuned NVFP4 calibration recipes
NVIDIA for the DGX Spark / GB10 platform, the NVFP4 format, and modelopt
vLLM project for compressed-tensors NVFP4 inference support

License

Llama 3.3 Community License, inherited from the base model meta-llama/Llama-3.3-70B-Instruct. Some restrictions apply (commercial use above 700 M monthly active users, attribution requirements). Pipeline code under Apache 2.0 at github.com/KaletoAI/distrib-nvfp4.

Full Llama 3.3 license text in the LICENSE file accompanying the base model.

Status

Single-author release. Issues + feedback welcome — both on the model artifact and on the pipeline that built it.

Downloads last month: -

Safetensors

Model size

41B params

Tensor type

BF16

F8_E4M3

Model tree for Kaleto/Llama-3.3-70B-Instruct-NVFP4

Base model

meta-llama/Llama-3.1-70B

Finetuned

meta-llama/Llama-3.3-70B-Instruct

Quantized

(148)

this model