ALIA-40b-instruct-2601 β€” NVFP4 (compressed-tensors, vLLM)

NVFP4 quantization of BSC-LT/ALIA-40b-instruct-2601 packaged in compressed-tensors format for vLLM. 27 GB on disk, ~9 tok/s steady-state generation on a single NVIDIA GB10 (DGX Spark) with native NVFP4 hardware kernels via FlashInfer + CUTLASS.

For the GGUF / llama.cpp / Ollama version of this model, see montevive/ALIA-40b-instruct-2601-NVFP4-GGUF.

File Format Size Use case
model-0000{1..6}-of-00006.safetensors NVFP4 (compressed-tensors / nvfp4-pack-quantized) 26.8 GB Blackwell GPUs (RTX 50xx, GB10, B-series) via vLLM or TRT-LLM

What's in this repo

  • The NVFP4 safetensors shards β€” produced by llmcompressor 0.10 (NVFP4 calibration on 512 samples of HuggingFaceH4/ultrachat_200k at sequence length 2048).
  • config.json with quantization_config: {format: "nvfp4-pack-quantized", quant_method: "compressed-tensors"} β€” vLLM auto-detects NVFP4 from this and selects the FlashInferCutlassNvFp4LinearKernel.
  • chat_template.jinja β€” ALIA's training-format chat template (folds the system prompt into the first user turn; see "Chat template" below).
  • recipe.yaml β€” the llmcompressor recipe used to produce the quant.
  • Tokenizer files β€” sentencepiece + the GPT2-style BPE merges from the source model.

Recommended runtime: vLLM

# Install vLLM with NVFP4 support (requires CUDA 13 toolkit on the host for flashinfer JIT)
pip install vllm   # 0.20+

# DGX Spark / Blackwell: ensure CUDA 13 nvcc is in PATH (otherwise flashinfer JIT fails for sm_120)
export CUDA_HOME=/usr/local/cuda-13.0
export PATH=/usr/local/cuda-13.0/bin:$PATH

# Serve via OpenAI-compatible API
vllm serve montevive/ALIA-40b-instruct-2601-NVFP4 \
    --dtype bfloat16 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.55 \
    --served-model-name alia-nvfp4

Or offline:

from vllm import LLM, SamplingParams

llm = LLM(
    model="montevive/ALIA-40b-instruct-2601-NVFP4",
    dtype="bfloat16",
    max_model_len=4096,
    gpu_memory_utilization=0.55,
)
sp = SamplingParams(temperature=0.1, top_p=0.9, max_tokens=200)
out = llm.chat(
    [[{"role": "user", "content": "La capital de EspaΓ±a es"}]],
    sp,
)
print(out[0].outputs[0].text)

Recommended sampling (per the base model card): temperature between 0 and 0.2; avoid repetition penalties β€” they degrade instruction-following on this model.

vLLM serving notes for DGX Spark (GB10, unified memory)

  • The Spark exposes 121 GB unified memory (no separate VRAM); gpu_memory_utilization=0.55 reserves ~66 GB for vLLM and leaves the rest for OS/page-cache.
  • After loading the model, the Linux page cache holds the safetensors shards (~27 GB). On subsequent vLLM launches, drop the cache with sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" if vLLM complains "Free memory on device cuda:0 (X/121 GiB) on startup is less than desired GPU memory utilization."
  • First launch JIT-compiles the sm_120 NVFP4 CUTLASS kernel (3-5 min) and torch.compiles the model graph (20 s). Both are cached under ~/.cache/flashinfer/ and ~/.cache/vllm/torch_compile_cache/ so subsequent launches drop to ~1 min.

Performance (DGX Spark, GB10 Blackwell, NVFP4)

Measured with vLLM 0.20.0 + torch 2.11.0+cu130 + flashinfer 0.6.8.post1, single-prompt smoke (no batching):

Prompt Tokens generated Time Throughput
Spanish (warmup) 162 40 s 4.0 tok/s
Catalan 200 23 s 8.7 tok/s
Basque (system override) 200 23 s 8.8 tok/s

Steady-state ~8.7 tok/s; the first run is dominated by cudagraph capture / kernel autotune and is not representative. With continuous batching across multiple users vLLM scales considerably higher.

For comparison: the GGUF version runs at ~10.2 tok/s under llama.cpp --jinja on the same hardware. The slight gap is single-prompt overhead in vLLM; for multi-tenant serving vLLM is typically faster.

Chat template

ALIA was trained with a non-standard ChatML variant: the system message is folded into the first user turn, separated by \n\n, instead of being emitted as its own <|im_start|>system ... <|im_end|> block (see BSC's PR #1 on the official GGUF).

The chat_template.jinja shipped in this repo encodes this exactly. vLLM (and transformers.AutoTokenizer.apply_chat_template) honor it automatically. For runtimes that ignore embedded Jinja templates (notably Ollama), see the GGUF repo's Modelfile for a Go-template equivalent.

Compatibility matrix

Runtime Native NVFP4 on Blackwell Notes
vLLM (recommended) βœ… Yes (FlashInfer + CUTLASS NVFP4 GEMM) Production-ready. CUDA 13 nvcc required on host for first-run JIT.
TensorRT-LLM βœ… Yes Same compressed-tensors format. Heavier setup.
HuggingFace transformers ❌ No (dequant-to-BF16 at load time) Loads fine via compressed-tensors library, but resident size jumps to ~76 GB and inference runs at BF16 speed. Useful for compatibility/eval, not production.
llama.cpp βœ… Yes (different format) Use the GGUF repo β€” compressed-tensors is not GGUF.

Quantization details

  • Tool: llmcompressor 0.10.0.1 + compressed-tensors 0.14.0.1
  • Scheme: NVFP4 (nvfp4-pack-quantized)
  • Weights: 4-bit float, group_size=16, symmetric, scale_dtype=float8_e4m3fn
  • Input activations: 4-bit float, dynamic local, group_size=16, symmetric, scale_dtype=float8_e4m3fn
  • Ignored layers: lm_head
  • Calibration dataset: HuggingFaceH4/ultrachat_200k, split train_sft[:512], max_seq_len 2048
  • Calibration runtime: ~1.5 h on dual RTX 3090 Ti (sequential per-layer offload)

Calibration caveat

Calibration used ultrachat_200k β€” an English-only synthetic conversation dataset. ALIA is BSC's Iberian-multilingual model, so this is suboptimal: the per-tensor and per-block scales were computed against an activation distribution that doesn't fully represent the model's actual use case. Re-quantizing with a multilingual instruction dataset (Iberian languages + code) would likely improve quality on Spanish/Catalan/Basque/Galician outputs. PRs welcome.

License & attribution

This quantized model is released under the same Apache 2.0 license as the source.

  • Base model: BSC-LT/ALIA-40b-instruct-2601 by Barcelona Supercomputing Center (BSC). Please cite their work if you use this model in research:

    @misc{alia-40b-instruct,
      author = {Barcelona Supercomputing Center},
      title  = {ALIA-40b-instruct},
      year   = {2026},
      url    = {https://huggingface.co/BSC-LT/ALIA-40b-instruct-2601}
    }
    
  • NVFP4 quantization: Montevive AI.

Limitations

Inherits all limitations of the base ALIA-40b-instruct-2601 model. Additionally:

  • NVFP4 inference quality is below BF16/FP8. Empirically fine on 40B but you should evaluate on your own task before deploying.
  • NVFP4 + Blackwell + vLLM is recent β€” expect API churn and occasional kernel issues. Tested against vLLM 0.20.0 / torch 2.11+cu130 / flashinfer 0.6.8.post1.
  • transformers users will fall back to BF16 (no native NVFP4 in stock transformers as of May 2026). Use vLLM or TRT-LLM for the speed/memory benefit.
  • Calibration was English-only (see above) β€” multilingual tasks may benefit from re-quantization with Iberian-language calibration data.
Downloads last month
34
Safetensors
Model size
25B params
Tensor type
BF16
Β·
F32
Β·
F8_E4M3
Β·
U8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for montevive/ALIA-40b-instruct-2601-NVFP4

Base model

BSC-LT/ALIA-40b
Quantized
(10)
this model