Instructions to use Kaleto/Anubis-Pro-105B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Kaleto/Anubis-Pro-105B-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Kaleto/Anubis-Pro-105B-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Kaleto/Anubis-Pro-105B-NVFP4") model = AutoModelForCausalLM.from_pretrained("Kaleto/Anubis-Pro-105B-NVFP4") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Kaleto/Anubis-Pro-105B-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Kaleto/Anubis-Pro-105B-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Kaleto/Anubis-Pro-105B-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Kaleto/Anubis-Pro-105B-NVFP4
- SGLang
How to use Kaleto/Anubis-Pro-105B-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Kaleto/Anubis-Pro-105B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Kaleto/Anubis-Pro-105B-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Kaleto/Anubis-Pro-105B-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Kaleto/Anubis-Pro-105B-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Kaleto/Anubis-Pro-105B-NVFP4 with Docker Model Runner:
docker model run hf.co/Kaleto/Anubis-Pro-105B-NVFP4
- Anubis-Pro-105B-v1 — NVFP4 (compressed-tensors)
Anubis-Pro-105B-v1 — NVFP4 (compressed-tensors)
Built with Llama.
NVFP4 (4-bit floating-point, W4A4, group_size=16) quantization of TheDrummer/Anubis-Pro-105B-v1, produced via a custom 2-node distributed pipeline on NVIDIA DGX Spark (GB10) hardware.
This is one of the first publicly released large-model NVFP4 quantizations to come out of the DGX Spark personal-AI ecosystem. The goal of publishing it (and the pipeline below) is to lower the bar for other Spark owners and Blackwell-class GPU users to do the same with their own favorite models.
Quick facts
| Base model | TheDrummer/Anubis-Pro-105B-v1 (Llama-3.3-70B upscaled to 105B + finetuned) |
| Architecture | LlamaForCausalLM, 120 layers, hidden_size=8192, 64 attn heads, 8 KV heads, head_dim=128 |
| Original size | ~196 GB (BF16) |
| Quantized size | ~58 GB (see Files tab) |
| Quant format | NVFP4 via nvidia-modelopt 0.43.0 |
| Storage layout | compressed-tensors (vLLM-native) |
| lm_head | Kept BF16 (unquantized), in quantization_config.ignore |
| KV cache | Configurable at serve time (FP8 recommended) |
| Calibration data | 256 samples from cnn_dailymail, lengths 150–1200 tokens |
| Conversion date | 2026-05-13 |
Why this exists
Quantizing 100B+ class models for the new NVIDIA DGX Spark workstation is not as turn-key as it sounds. The standard single-node modelopt hf_ptq.py workflow silently fails on GB10's 128 GB unified memory (the accelerate library misdetects unified memory as a 5.2 TB GPU and triggers an OOM-kill during shard loading). Patching it to work via --low_memory_mode is also a known dead end — calibration "completes" but produces NaN block-scales for any model above ~70B class.
This release is the first Anubis-Pro NVFP4 that actually has clean, non-NaN block scales for all 840 weight quantizers per shard, and that's because it uses a distributed two-node pipeline that sidesteps the unified-memory pitfalls entirely.
If you have a DGX Spark and you want to do this yourself, the pipeline is open-source at github.com/KaletoAI/distrib-nvfp4 (Apache 2.0). It's model-agnostic (Llama, Mistral-Large), has resume-from-checkpoint, and ships with a 1-layer smoke test so you can validate it on a 7B before committing to a 100B run.
The hardware: NVIDIA DGX Spark (GB10)
If you don't know the Spark yet — it's NVIDIA's compact personal-AI workstation released in early 2026. Each unit is roughly the size of a Mac mini, runs on ~140 W at the wall under heavy load, and has:
- GB10 superchip: Grace ARM CPU + Blackwell GPU on the same package (sm_121)
- 128 GB LPDDR5X unified memory shared between CPU and GPU, ~900 GB/s aggregate bandwidth
- ConnectX-7 200 Gbit/s for cluster scaling
- ~1 PFLOP FP4 compute
A single Spark serves 30B-class models comfortably and 70B-class with FP8/NVFP4 quantization. Two Sparks in a small cluster (256 GB combined UMA) open the door to 100B–130B-class models like this one, and to local quantization of models in that range. This is currently one of the most practical ways to run frontier-size open-weight models from a power outlet under a desk.
This model was produced on, and is intended to run on, exactly that setup.
Cluster used for this conversion
- 2× DGX Spark, each GB10 + 128 GB UMA = 256 GB combined
- ConnectX-7 200 GbE backbone, measured 44 GB/s effective NCCL AllReduce over IB
- ~280 W total system draw at the wall under sustained load
- Distributed quantization via Ray, 60 layers per actor
Quantization Pipeline (short version)
Each of the two Ray actors owns half the layers and materializes only its own weights via init_empty_weights + selective set_module_tensor_to_device. modelopt's mtq.quantize(wrapper, NVFP4_DEFAULT_CFG, forward_loop=None) inserts quantizers in calibration mode without running its own forward, so the driver can route hidden states between actors over Ray RPC for each calibration sample.
After 256 samples × variable length, finalize, then per-actor streaming export via export_hf_checkpoint on a 1-layer-at-a-time mini template (the only way to avoid OOM on a 128 GB UMA pool already holding 105 GB of model weights). Driver merges per-actor shards, renames layer indices on the second half, copies tokenizer files, patches config.json to keep lm_head BF16 via the ignore list.
Calibration health-check passed cleanly on the run that produced this artifact:
- shard0 (layers 0–59 + embed): good=420, zero=0, nan=0
- shard1 (layers 60–119 + norm + lm_head): good=420, zero=0, nan=0
Performance
Tested on a single DGX Spark (GB10) running vLLM with this NVFP4 model loaded.
Stock vLLM (CUTLASS GEMM, default backend)
| Context length | Prompt processing | Token generation (per stream) | Memory used |
|---|---|---|---|
| 4 096 | ~340 tok/s | ~3.1 tok/s | ~109 GB |
| 16 384 | ~650 tok/s | ~2.9 tok/s | ~109 GB |
| 32 768 | ~850 tok/s | ~2.9 tok/s | ~109 GB |
Memory is constant across context lengths because vLLM pre-allocates the KV-cache pool at startup (--gpu-memory-utilization 0.85 → ~101 GB pool plus ~58 GB for weights, rounded). Token-generation latency is essentially context-independent at ~340 ms inter-token (1 GB10 GPU, no tensor parallelism).
Per-stream decode rate stays at ~3 tok/s across all tested contexts. Aggregate throughput scales with concurrency — at 4K context with --max-concurrency 4, the server processes ~10.4 tok/s of output and ~167 tok/s combined (prompt + decode). At 16K with concurrency 2, aggregate output is ~4.1 tok/s; total throughput ~264 tok/s.
Tuned: MARLIN-GEMM + FlashInfer (Avarok stack)
The community-converged Spark runtime stack uses MARLIN as the NVFP4 GEMM kernel instead of CUTLASS, and FlashInfer as the attention backend. Adding the three env vars and one flag below to the same vLLM 0.20.2 build:
VLLM_NVFP4_GEMM_BACKEND=marlin
VLLM_TEST_FORCE_FP8_MARLIN=1
VLLM_MARLIN_USE_ATOMIC_ADD=1
plus --attention-backend flashinfer on the serve command, gives this on the same Spark and model:
| Context length | Token generation | Speedup vs stock |
|---|---|---|
| short (~50 tok) | 3.78 tok/s | +22 % |
| ~2.6 K | 3.14 tok/s | +1 % |
Measured over 5 sequential decode-only requests (200 tokens each); inter-run std-dev under 1 % (3.76–3.78 range). Bench script in the public pipeline repo (bench_v2.sh).
The speedup is concentrated at short context (decode is compute-bound; MARLIN's faster NVFP4 GEMM dominates). At long context (≥4K), decode becomes memory-bound on the KV cache and MARLIN's win shrinks because the bottleneck is bandwidth, not GEMM. Use the tuned env vars for short-prompt / interactive workloads where the speedup is real; long-context throughput is essentially the same as stock.
Cold load (vLLM startup, end-to-end first-request latency from disk): ~520 s (8:40) for the 58 GB shards, single Spark. First load includes MARLIN's per-kernel JIT compile — cached for subsequent loads.
Stock-bench config: --quantization compressed-tensors --kv-cache-dtype fp8 --max-num-seqs 4 --gpu-memory-utilization 0.85 with vLLM 0.20.2rc1.dev53+g01b9b5af6 and no runtime env-var tuning. See Avarok's blog post for background on the MARLIN port. The Avarok dgx-vllm Docker image bundles the same configuration for users who don't want to maintain a custom vLLM build.
Benchmark command (reproducible):
vllm bench serve \
--backend openai-chat \
--base-url http://127.0.0.1:9005 \
--endpoint /v1/chat/completions \
--model Anubis-Pro-105B-NVFP4 \
--tokenizer /path/to/Anubis-Pro-105B-NVFP4 \
--dataset-name random \
--random-input-len <PROMPT_LEN> \
--random-output-len 256 \
--num-prompts <N> \
--max-concurrency <C> \
--seed 42
Tested triples (prompt_len, num_prompts, max_concurrency) were (3840, 4, 4), (16128, 4, 2), (32000, 2, 1). vLLM build: 0.20.2rc1.dev53+g01b9b5af6.
Usage
vLLM (direct)
For the tuned Spark stack (recommended on GB10 — see Performance section), prepend the three env vars and add --attention-backend flashinfer:
VLLM_NVFP4_GEMM_BACKEND=marlin \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
VLLM_MARLIN_USE_ATOMIC_ADD=1 \
vllm serve /path/to/Anubis-Pro-105B-NVFP4 \
--served-model-name Anubis-Pro-105B-NVFP4 \
--attention-backend flashinfer \
--quantization compressed-tensors \
--dtype auto \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.85 \
--enable-chunked-prefill \
--enable-prefix-caching \
--port 9005
Drop the env vars and --attention-backend flashinfer to fall back to stock vLLM behaviour (CUTLASS GEMM, vLLM's default attention pick — usually FlashInfer auto-selected on Blackwell anyway).
llama-swap entry
"Anubis-Pro-105B-NVFP4":
proxy: "http://127.0.0.1:9005"
ttl: 0
checkEndpoint: "/health"
env:
- "VLLM_NVFP4_GEMM_BACKEND=marlin"
- "VLLM_TEST_FORCE_FP8_MARLIN=1"
- "VLLM_MARLIN_USE_ATOMIC_ADD=1"
cmd: >-
/home/<user>/vllm-env/bin/python3 -m vllm.entrypoints.openai.api_server
--model /home/<user>/models/Anubis-Pro-105B-NVFP4
--attention-backend flashinfer
--served-model-name Anubis-Pro-105B-NVFP4
--quantization compressed-tensors
--dtype auto
--kv-cache-dtype fp8
--max-model-len 32768
--max-num-seqs 4
--gpu-memory-utilization 0.85
--trust-remote-code
--enable-chunked-prefill
--enable-prefix-caching
--port 9005
--host 127.0.0.1
Recommended sampling
From TheDrummer's original card and community testing:
- Chat template: Llama 3 (for RP and instruct) or Alpaca (for story adventure)
- Setting A (community favorite): temp 0.75, smoothing_factor 0.2, smoothing_curve 2, min-p 0.01, DRY (multiplier 4, allowed_length 1, base 3) — temp_last
- Setting B (alternative): temp 1.0, min-p 0.02 — pairs well with "Llamaception" prompt templates
Quick test
curl http://localhost:9005/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "Anubis-Pro-105B-NVFP4",
"messages": [{"role":"user","content":"Hello, who are you?"}],
"max_tokens": 100,
"temperature": 0.75
}'
Limitations and caveats
- Blackwell required. NVFP4 Tensor Cores live on sm_100+ hardware (B100, B200, GB10, RTX 5090 family). On older GPUs vLLM either refuses to load or falls back to a slow software path. To verify the fast path on Spark, check the startup log for
Using AttentionBackendEnum.FLASHINFER backend.(attention) — if you seemarlinorW4A16as the dense-matmul kernel, you're on a fallback. For MoE models (not us — we are dense Llama), look additionally forUsing 'MARLIN' NvFp4 MoE backend. - vLLM ≥ 0.20.2 required. The FlashInfer NVFP4 GEMM kernel (
vllm/model_executor/kernels/linear/nvfp4/flashinfer.py) was added in the 0.20.2 release line. Older vLLM builds will silently fall back to Marlin or W4A16 — output may also be wrong, not just slow. This model was produced and verified against0.20.2rc1.dev53+g01b9b5af6. No vLLM source patches are required; everything needed is in the model files. Theinput_scalekeys (sidecar filemodel-input_scales.safetensors) are necessary because modelopt 0.43's NVFP4 exporter omits them and vLLM's loader needs them to be present even though dynamic input quantization is used at runtime. - Quality vs BF16. NVFP4 weight quantization introduces measurable but small loss. For creative writing and roleplay (this model's strong suit) it is barely noticeable in community testing. For arithmetic-heavy or strict instruction-following workloads, FP8 or BF16 variants may be preferable.
- Calibration domain. Calibrated on cnn_dailymail (news text). Re-calibrating with domain-specific data (code, RP transcripts, etc.) might marginally improve outcomes for those uses. The pipeline supports swapping the calibration set with one line of code.
- EU users / multimodal: Anubis-Pro is text-only, so the Llama 3.3 EU-specific multimodal restriction does not apply.
Files in this repository
config.json— model config withquantization_configblock (note:input_activations.dynamic: trueis required, see Recent Fixes)hf_quant_config.json— modelopt-style quant manifestgeneration_config.json— defaultsmodel-NNNNN-of-NNNNN.safetensors+model.safetensors.index.json— weights (12 shards, ~58 GB total)model-input_scales.safetensors— small (84 KB) sidecar containinginput_scale = 1.0for every quantized Linear. Required because modelopt 0.43's NVFP4 exporter omits these keys; vLLM's loader needs them present even though dynamic input quantization is used at runtime. See Recent Fixes fix #6.tokenizer.json,tokenizer_config.json,special_tokens_map.json— tokenizer (chat template is embedded intokenizer_config.json).gitattributes— LFS markers for*.safetensorsandtokenizer.jsonLICENSE— Llama 3.3 Community License Agreement (full Meta text)NOTICE— required attribution
Future work (v2 ideas, not in this release)
- Re-calibrate with an RP-domain dataset — current calibration is generic
cnn_dailymailnews text. Mixing in light-novel translations, RP transcripts, fiction would better match the activation distributions the model sees at serving time. Calibration domain determines which weights get crushed by FP4 rounding; a domain-matched calibration is a free ~accuracy win. - Try
lm_headalso quantized — currently kept in BF16. Quantizing it would shave another ~3 tok/s on Spark at roughly ~1 % accuracy hit. Borderline tradeoff for an RP/storytelling model; worth measuring. - GB10-tuned ignore list —
saricles-style targeted retention of small / bandwidth-critical matrices in BF16. Less impactful for dense Llama than for MoE but still a candidate optimization. - AWQ-4bit companion — if anyone publishes (or asks for) an AWQ variant of Anubis-Pro-105B for comparison, having both formats in one author space helps the community calibrate "is NVFP4 worth it for me?" decisions.
Acknowledgments
- TheDrummer (BeaverAI) for Anubis-Pro-105B-v1, the base model this is built from
- Meta for Llama 3.3, the original foundation model
- NVIDIA for the DGX Spark / GB10 platform, the NVFP4 format, and modelopt
- vLLM project for compressed-tensors NVFP4 inference support
- Avarok-Cybersecurity (
tbraun96) for the MARLIN-backend port of NVFP4 GEMM and theavarok/dgx-vllm-nvfp4-kernelDocker image that made NVFP4 actually competitive on Spark — see their blog post. Use their runtime stack to get peak speed from this model. - saricles for setting the state-of-the-art bar on GB10-tuned NVFP4 quants — their
MiniMax-M2.5-REAP-…-NVFP4-GB10,Qwen3-Coder-Next-NVFP4-GB10, and the documentedquantize-nvfp4-gb10-agentic.pyrecipe are the reference for what a GB10-specific quantization recipe looks like. This release uses the modelopt default and is not GB10-tuned in that sense — a future v2 might be. - RedHatAI for
Qwen3.5-122B-A10B-NVFP4and similar llmcompressor-based releases — the closest size analog to this model. - lukealonso for active NVFP4 publishing.
- mradermacher and bartowski for setting community precedent on Anubis-Pro requants in other formats (GGUF).
License & attribution
This model is a derivative of Llama 3.3 70B (via TheDrummer/Anubis-Pro-105B-v1) and is therefore distributed under the Llama 3.3 Community License Agreement.
- Full license text: see
LICENSEin this repository - Required attribution: see
NOTICE - Acceptable Use Policy: https://www.llama.com/llama3_3/use-policy
By downloading, using, or redistributing this model you agree to the terms of the Llama 3.3 Community License Agreement.
Llama 3.3 is licensed under the Llama 3.3 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.
Built with Llama.
- Downloads last month
- 224
Model tree for Kaleto/Anubis-Pro-105B-NVFP4
Base model
TheDrummer/Anubis-Pro-105B-v1