Instructions to use Kaleto/Anubis-Pro-105B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Kaleto/Anubis-Pro-105B-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Kaleto/Anubis-Pro-105B-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Kaleto/Anubis-Pro-105B-NVFP4")
model = AutoModelForCausalLM.from_pretrained("Kaleto/Anubis-Pro-105B-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Kaleto/Anubis-Pro-105B-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Kaleto/Anubis-Pro-105B-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Kaleto/Anubis-Pro-105B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Kaleto/Anubis-Pro-105B-NVFP4

SGLang

How to use Kaleto/Anubis-Pro-105B-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Kaleto/Anubis-Pro-105B-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Kaleto/Anubis-Pro-105B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Kaleto/Anubis-Pro-105B-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Kaleto/Anubis-Pro-105B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Kaleto/Anubis-Pro-105B-NVFP4 with Docker Model Runner:
```
docker model run hf.co/Kaleto/Anubis-Pro-105B-NVFP4
```

Anubis-Pro-105B-v1 — NVFP4 (compressed-tensors)

Built with Llama.

NVFP4 (4-bit floating-point, W4A4, group_size=16) quantization of TheDrummer/Anubis-Pro-105B-v1, produced via a custom 2-node distributed pipeline on NVIDIA DGX Spark (GB10) hardware.

This is one of the first publicly released large-model NVFP4 quantizations to come out of the DGX Spark personal-AI ecosystem. The goal of publishing it (and the pipeline below) is to lower the bar for other Spark owners and Blackwell-class GPU users to do the same with their own favorite models.

Quick facts


Base model	TheDrummer/Anubis-Pro-105B-v1 (Llama-3.3-70B upscaled to 105B + finetuned)
Architecture	LlamaForCausalLM, 120 layers, hidden_size=8192, 64 attn heads, 8 KV heads, head_dim=128
Original size	~196 GB (BF16)
Quantized size	~58 GB (see Files tab)
Quant format	NVFP4 via nvidia-modelopt 0.43.0
Storage layout	compressed-tensors (vLLM-native)
lm_head	Kept BF16 (unquantized), in `quantization_config.ignore`
KV cache	Configurable at serve time (FP8 recommended)
Calibration data	256 samples from `cnn_dailymail`, lengths 150–1200 tokens
Conversion date	2026-05-13

Why this exists

Quantizing 100B+ class models for the new NVIDIA DGX Spark workstation is not as turn-key as it sounds. The standard single-node modelopt hf_ptq.py workflow silently fails on GB10's 128 GB unified memory (the accelerate library misdetects unified memory as a 5.2 TB GPU and triggers an OOM-kill during shard loading). Patching it to work via --low_memory_mode is also a known dead end — calibration "completes" but produces NaN block-scales for any model above ~70B class.

This release is the first Anubis-Pro NVFP4 that actually has clean, non-NaN block scales for all 840 weight quantizers per shard, and that's because it uses a distributed two-node pipeline that sidesteps the unified-memory pitfalls entirely.

If you have a DGX Spark and you want to do this yourself, the pipeline is open-source at github.com/KaletoAI/distrib-nvfp4 (Apache 2.0). It's model-agnostic (Llama, Mistral-Large), has resume-from-checkpoint, and ships with a 1-layer smoke test so you can validate it on a 7B before committing to a 100B run.

The hardware: NVIDIA DGX Spark (GB10)

If you don't know the Spark yet — it's NVIDIA's compact personal-AI workstation released in early 2026. Each unit is roughly the size of a Mac mini, runs on ~140 W at the wall under heavy load, and has:

GB10 superchip: Grace ARM CPU + Blackwell GPU on the same package (sm_121)
128 GB LPDDR5X unified memory shared between CPU and GPU, ~900 GB/s aggregate bandwidth
ConnectX-7 200 Gbit/s for cluster scaling
~1 PFLOP FP4 compute

A single Spark serves 30B-class models comfortably and 70B-class with FP8/NVFP4 quantization. Two Sparks in a small cluster (256 GB combined UMA) open the door to 100B–130B-class models like this one, and to local quantization of models in that range. This is currently one of the most practical ways to run frontier-size open-weight models from a power outlet under a desk.

This model was produced on, and is intended to run on, exactly that setup.

Cluster used for this conversion

2× DGX Spark, each GB10 + 128 GB UMA = 256 GB combined
ConnectX-7 200 GbE backbone, measured 44 GB/s effective NCCL AllReduce over IB
~280 W total system draw at the wall under sustained load
Distributed quantization via Ray, 60 layers per actor

Quantization Pipeline (short version)

Each of the two Ray actors owns half the layers and materializes only its own weights via init_empty_weights + selective set_module_tensor_to_device. modelopt's mtq.quantize(wrapper, NVFP4_DEFAULT_CFG, forward_loop=None) inserts quantizers in calibration mode without running its own forward, so the driver can route hidden states between actors over Ray RPC for each calibration sample.

After 256 samples × variable length, finalize, then per-actor streaming export via export_hf_checkpoint on a 1-layer-at-a-time mini template (the only way to avoid OOM on a 128 GB UMA pool already holding 105 GB of model weights). Driver merges per-actor shards, renames layer indices on the second half, copies tokenizer files, patches config.json to keep lm_head BF16 via the ignore list.

Calibration health-check passed cleanly on the run that produced this artifact:

shard0 (layers 0–59 + embed): good=420, zero=0, nan=0
shard1 (layers 60–119 + norm + lm_head): good=420, zero=0, nan=0

Performance

Tested on a single DGX Spark (GB10) running vLLM with this NVFP4 model loaded.

Stock vLLM (CUTLASS GEMM, default backend)

Context length	Prompt processing	Token generation (per stream)	Memory used
4 096	~340 tok/s	~3.1 tok/s	~109 GB
16 384	~650 tok/s	~2.9 tok/s	~109 GB
32 768	~850 tok/s	~2.9 tok/s	~109 GB

Memory is constant across context lengths because vLLM pre-allocates the KV-cache pool at startup (--gpu-memory-utilization 0.85 → ~101 GB pool plus ~58 GB for weights, rounded). Token-generation latency is essentially context-independent at ~340 ms inter-token (1 GB10 GPU, no tensor parallelism).

Per-stream decode rate stays at ~3 tok/s across all tested contexts. Aggregate throughput scales with concurrency — at 4K context with --max-concurrency 4, the server processes ~10.4 tok/s of output and ~167 tok/s combined (prompt + decode). At 16K with concurrency 2, aggregate output is ~4.1 tok/s; total throughput ~264 tok/s.

Tuned: MARLIN-GEMM + FlashInfer (Avarok stack)

The community-converged Spark runtime stack uses MARLIN as the NVFP4 GEMM kernel instead of CUTLASS, and FlashInfer as the attention backend. Adding the three env vars and one flag below to the same vLLM 0.20.2 build:

VLLM_NVFP4_GEMM_BACKEND=marlin
VLLM_TEST_FORCE_FP8_MARLIN=1
VLLM_MARLIN_USE_ATOMIC_ADD=1

plus --attention-backend flashinfer on the serve command, gives this on the same Spark and model:

Context length	Token generation	Speedup vs stock
short (~50 tok)	3.78 tok/s	+22 %
~2.6 K	3.14 tok/s	+1 %

Measured over 5 sequential decode-only requests (200 tokens each); inter-run std-dev under 1 % (3.76–3.78 range). Bench script in the public pipeline repo (bench_v2.sh).

The speedup is concentrated at short context (decode is compute-bound; MARLIN's faster NVFP4 GEMM dominates). At long context (≥4K), decode becomes memory-bound on the KV cache and MARLIN's win shrinks because the bottleneck is bandwidth, not GEMM. Use the tuned env vars for short-prompt / interactive workloads where the speedup is real; long-context throughput is essentially the same as stock.

Cold load (vLLM startup, end-to-end first-request latency from disk): ~520 s (8:40) for the 58 GB shards, single Spark. First load includes MARLIN's per-kernel JIT compile — cached for subsequent loads.

Stock-bench config: --quantization compressed-tensors --kv-cache-dtype fp8 --max-num-seqs 4 --gpu-memory-utilization 0.85 with vLLM 0.20.2rc1.dev53+g01b9b5af6 and no runtime env-var tuning. See Avarok's blog post for background on the MARLIN port. The Avarok dgx-vllm Docker image bundles the same configuration for users who don't want to maintain a custom vLLM build.

Benchmark command (reproducible):

vllm bench serve \
  --backend openai-chat \
  --base-url http://127.0.0.1:9005 \
  --endpoint /v1/chat/completions \
  --model Anubis-Pro-105B-NVFP4 \
  --tokenizer /path/to/Anubis-Pro-105B-NVFP4 \
  --dataset-name random \
  --random-input-len <PROMPT_LEN> \
  --random-output-len 256 \
  --num-prompts <N> \
  --max-concurrency <C> \
  --seed 42

Tested triples (prompt_len, num_prompts, max_concurrency) were (3840, 4, 4), (16128, 4, 2), (32000, 2, 1). vLLM build: 0.20.2rc1.dev53+g01b9b5af6.

Usage

vLLM (direct)

For the tuned Spark stack (recommended on GB10 — see Performance section), prepend the three env vars and add --attention-backend flashinfer:

VLLM_NVFP4_GEMM_BACKEND=marlin \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
VLLM_MARLIN_USE_ATOMIC_ADD=1 \
vllm serve /path/to/Anubis-Pro-105B-NVFP4 \
  --served-model-name Anubis-Pro-105B-NVFP4 \
  --attention-backend flashinfer \
  --quantization compressed-tensors \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.85 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --port 9005

Drop the env vars and --attention-backend flashinfer to fall back to stock vLLM behaviour (CUTLASS GEMM, vLLM's default attention pick — usually FlashInfer auto-selected on Blackwell anyway).

llama-swap entry

"Anubis-Pro-105B-NVFP4":
  proxy: "http://127.0.0.1:9005"
  ttl: 0
  checkEndpoint: "/health"
  env:
    - "VLLM_NVFP4_GEMM_BACKEND=marlin"
    - "VLLM_TEST_FORCE_FP8_MARLIN=1"
    - "VLLM_MARLIN_USE_ATOMIC_ADD=1"
  cmd: >-
    /home/<user>/vllm-env/bin/python3 -m vllm.entrypoints.openai.api_server
    --model /home/<user>/models/Anubis-Pro-105B-NVFP4
    --attention-backend flashinfer
    --served-model-name Anubis-Pro-105B-NVFP4
    --quantization compressed-tensors
    --dtype auto
    --kv-cache-dtype fp8
    --max-model-len 32768
    --max-num-seqs 4
    --gpu-memory-utilization 0.85
    --trust-remote-code
    --enable-chunked-prefill
    --enable-prefix-caching
    --port 9005
    --host 127.0.0.1

Recommended sampling

From TheDrummer's original card and community testing:

Chat template: Llama 3 (for RP and instruct) or Alpaca (for story adventure)
Setting A (community favorite): temp 0.75, smoothing_factor 0.2, smoothing_curve 2, min-p 0.01, DRY (multiplier 4, allowed_length 1, base 3) — temp_last
Setting B (alternative): temp 1.0, min-p 0.02 — pairs well with "Llamaception" prompt templates

Quick test

curl http://localhost:9005/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "Anubis-Pro-105B-NVFP4",
    "messages": [{"role":"user","content":"Hello, who are you?"}],
    "max_tokens": 100,
    "temperature": 0.75
  }'

Limitations and caveats

Blackwell required. NVFP4 Tensor Cores live on sm_100+ hardware (B100, B200, GB10, RTX 5090 family). On older GPUs vLLM either refuses to load or falls back to a slow software path. To verify the fast path on Spark, check the startup log for Using AttentionBackendEnum.FLASHINFER backend. (attention) — if you see marlin or W4A16 as the dense-matmul kernel, you're on a fallback. For MoE models (not us — we are dense Llama), look additionally for Using 'MARLIN' NvFp4 MoE backend.
vLLM ≥ 0.20.2 required. The FlashInfer NVFP4 GEMM kernel (vllm/model_executor/kernels/linear/nvfp4/flashinfer.py) was added in the 0.20.2 release line. Older vLLM builds will silently fall back to Marlin or W4A16 — output may also be wrong, not just slow. This model was produced and verified against 0.20.2rc1.dev53+g01b9b5af6. No vLLM source patches are required; everything needed is in the model files. The input_scale keys (sidecar file model-input_scales.safetensors) are necessary because modelopt 0.43's NVFP4 exporter omits them and vLLM's loader needs them to be present even though dynamic input quantization is used at runtime.
Quality vs BF16. NVFP4 weight quantization introduces measurable but small loss. For creative writing and roleplay (this model's strong suit) it is barely noticeable in community testing. For arithmetic-heavy or strict instruction-following workloads, FP8 or BF16 variants may be preferable.
Calibration domain. Calibrated on cnn_dailymail (news text). Re-calibrating with domain-specific data (code, RP transcripts, etc.) might marginally improve outcomes for those uses. The pipeline supports swapping the calibration set with one line of code.
EU users / multimodal: Anubis-Pro is text-only, so the Llama 3.3 EU-specific multimodal restriction does not apply.

Files in this repository

config.json — model config with quantization_config block (note: input_activations.dynamic: true is required, see Recent Fixes)
hf_quant_config.json — modelopt-style quant manifest
generation_config.json — defaults
model-NNNNN-of-NNNNN.safetensors + model.safetensors.index.json — weights (12 shards, ~58 GB total)
model-input_scales.safetensors — small (84 KB) sidecar containing input_scale = 1.0 for every quantized Linear. Required because modelopt 0.43's NVFP4 exporter omits these keys; vLLM's loader needs them present even though dynamic input quantization is used at runtime. See Recent Fixes fix #6.
tokenizer.json, tokenizer_config.json, special_tokens_map.json — tokenizer (chat template is embedded in tokenizer_config.json)
.gitattributes — LFS markers for *.safetensors and tokenizer.json
LICENSE — Llama 3.3 Community License Agreement (full Meta text)
NOTICE — required attribution

Future work (v2 ideas, not in this release)

Re-calibrate with an RP-domain dataset — current calibration is generic cnn_dailymail news text. Mixing in light-novel translations, RP transcripts, fiction would better match the activation distributions the model sees at serving time. Calibration domain determines which weights get crushed by FP4 rounding; a domain-matched calibration is a free ~accuracy win.
Try lm_head also quantized — currently kept in BF16. Quantizing it would shave another ~3 tok/s on Spark at roughly ~1 % accuracy hit. Borderline tradeoff for an RP/storytelling model; worth measuring.
GB10-tuned ignore list — saricles-style targeted retention of small / bandwidth-critical matrices in BF16. Less impactful for dense Llama than for MoE but still a candidate optimization.
AWQ-4bit companion — if anyone publishes (or asks for) an AWQ variant of Anubis-Pro-105B for comparison, having both formats in one author space helps the community calibrate "is NVFP4 worth it for me?" decisions.

Acknowledgments

TheDrummer (BeaverAI) for Anubis-Pro-105B-v1, the base model this is built from
Meta for Llama 3.3, the original foundation model
NVIDIA for the DGX Spark / GB10 platform, the NVFP4 format, and modelopt
vLLM project for compressed-tensors NVFP4 inference support
Avarok-Cybersecurity (tbraun96) for the MARLIN-backend port of NVFP4 GEMM and the avarok/dgx-vllm-nvfp4-kernel Docker image that made NVFP4 actually competitive on Spark — see their blog post. Use their runtime stack to get peak speed from this model.
saricles for setting the state-of-the-art bar on GB10-tuned NVFP4 quants — their MiniMax-M2.5-REAP-…-NVFP4-GB10, Qwen3-Coder-Next-NVFP4-GB10, and the documented quantize-nvfp4-gb10-agentic.py recipe are the reference for what a GB10-specific quantization recipe looks like. This release uses the modelopt default and is not GB10-tuned in that sense — a future v2 might be.
RedHatAI for Qwen3.5-122B-A10B-NVFP4 and similar llmcompressor-based releases — the closest size analog to this model.
lukealonso for active NVFP4 publishing.
mradermacher and bartowski for setting community precedent on Anubis-Pro requants in other formats (GGUF).

License & attribution

This model is a derivative of Llama 3.3 70B (via TheDrummer/Anubis-Pro-105B-v1) and is therefore distributed under the Llama 3.3 Community License Agreement.

Full license text: see LICENSE in this repository
Required attribution: see NOTICE
Acceptable Use Policy: https://www.llama.com/llama3_3/use-policy

By downloading, using, or redistributing this model you agree to the terms of the Llama 3.3 Community License Agreement.

Built with Llama.

Downloads last month: 224

Safetensors

Model size

60B params

Tensor type

BF16

F8_E4M3

Model tree for Kaleto/Anubis-Pro-105B-NVFP4

Base model

TheDrummer/Anubis-Pro-105B-v1

Quantized

(10)

this model