ALIA-40b-instruct-2601 — NVFP4 (compressed-tensors, vLLM)

NVFP4 quantization of BSC-LT/ALIA-40b-instruct-2601 packaged in compressed-tensors format for vLLM. 27 GB on disk, ~9 tok/s steady-state generation on a single NVIDIA GB10 (DGX Spark) with native NVFP4 hardware kernels via FlashInfer + CUTLASS.

For the GGUF / llama.cpp / Ollama version of this model, see montevive/ALIA-40b-instruct-2601-NVFP4-GGUF.

File	Format	Size	Use case
`model-0000{1..6}-of-00006.safetensors`	NVFP4 (`compressed-tensors` / `nvfp4-pack-quantized`)	26.8 GB	Blackwell GPUs (RTX 50xx, GB10, B-series) via vLLM or TRT-LLM

What's in this repo

The NVFP4 safetensors shards — produced by llmcompressor 0.10 (NVFP4 calibration on 512 samples of HuggingFaceH4/ultrachat_200k at sequence length 2048).
config.json with quantization_config: {format: "nvfp4-pack-quantized", quant_method: "compressed-tensors"} — vLLM auto-detects NVFP4 from this and selects the FlashInferCutlassNvFp4LinearKernel.
chat_template.jinja — ALIA's training-format chat template (folds the system prompt into the first user turn; see "Chat template" below).
recipe.yaml — the llmcompressor recipe used to produce the quant.
Tokenizer files — sentencepiece + the GPT2-style BPE merges from the source model.

Recommended runtime: vLLM

# Install vLLM with NVFP4 support (requires CUDA 13 toolkit on the host for flashinfer JIT)
pip install vllm   # 0.20+

# DGX Spark / Blackwell: ensure CUDA 13 nvcc is in PATH (otherwise flashinfer JIT fails for sm_120)
export CUDA_HOME=/usr/local/cuda-13.0
export PATH=/usr/local/cuda-13.0/bin:$PATH

# Serve via OpenAI-compatible API
vllm serve montevive/ALIA-40b-instruct-2601-NVFP4 \
    --dtype bfloat16 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.55 \
    --served-model-name alia-nvfp4

Or offline:

from vllm import LLM, SamplingParams

llm = LLM(
    model="montevive/ALIA-40b-instruct-2601-NVFP4",
    dtype="bfloat16",
    max_model_len=4096,
    gpu_memory_utilization=0.55,
)
sp = SamplingParams(temperature=0.1, top_p=0.9, max_tokens=200)
out = llm.chat(
    [[{"role": "user", "content": "La capital de España es"}]],
    sp,
)
print(out[0].outputs[0].text)

Recommended sampling (per the base model card): temperature between 0 and 0.2; avoid repetition penalties — they degrade instruction-following on this model.

vLLM serving notes for DGX Spark (GB10, unified memory)

The Spark exposes 121 GB unified memory (no separate VRAM); gpu_memory_utilization=0.55 reserves ~66 GB for vLLM and leaves the rest for OS/page-cache.
After loading the model, the Linux page cache holds the safetensors shards (~27 GB). On subsequent vLLM launches, drop the cache with sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" if vLLM complains "Free memory on device cuda:0 (X/121 GiB) on startup is less than desired GPU memory utilization."
First launch JIT-compiles the sm_120 NVFP4 CUTLASS kernel (~~3-5 min) and torch.compiles the model graph (~~20 s). Both are cached under ~/.cache/flashinfer/ and ~/.cache/vllm/torch_compile_cache/ so subsequent launches drop to ~1 min.

Performance (DGX Spark, GB10 Blackwell, NVFP4)

Measured with vLLM 0.20.0 + torch 2.11.0+cu130 + flashinfer 0.6.8.post1, single-prompt smoke (no batching):

Prompt	Tokens generated	Time	Throughput
Spanish (warmup)	162	40 s	4.0 tok/s
Catalan	200	23 s	8.7 tok/s
Basque (system override)	200	23 s	8.8 tok/s

Steady-state ~8.7 tok/s; the first run is dominated by cudagraph capture / kernel autotune and is not representative. With continuous batching across multiple users vLLM scales considerably higher.

For comparison: the GGUF version runs at ~10.2 tok/s under llama.cpp --jinja on the same hardware. The slight gap is single-prompt overhead in vLLM; for multi-tenant serving vLLM is typically faster.

Chat template

ALIA was trained with a non-standard ChatML variant: the system message is folded into the first user turn, separated by \n\n, instead of being emitted as its own <|im_start|>system ... <|im_end|> block (see BSC's PR #1 on the official GGUF).

The chat_template.jinja shipped in this repo encodes this exactly. vLLM (and transformers.AutoTokenizer.apply_chat_template) honor it automatically. For runtimes that ignore embedded Jinja templates (notably Ollama), see the GGUF repo's Modelfile for a Go-template equivalent.

Compatibility matrix

Runtime	Native NVFP4 on Blackwell	Notes
vLLM (recommended)	✅ Yes (FlashInfer + CUTLASS NVFP4 GEMM)	Production-ready. CUDA 13 nvcc required on host for first-run JIT.
TensorRT-LLM	✅ Yes	Same `compressed-tensors` format. Heavier setup.
HuggingFace `transformers`	❌ No (dequant-to-BF16 at load time)	Loads fine via `compressed-tensors` library, but resident size jumps to ~76 GB and inference runs at BF16 speed. Useful for compatibility/eval, not production.
llama.cpp	✅ Yes (different format)	Use the GGUF repo — `compressed-tensors` is not GGUF.

Quantization details

Tool: llmcompressor 0.10.0.1 + compressed-tensors 0.14.0.1
Scheme: NVFP4 (nvfp4-pack-quantized)
Weights: 4-bit float, group_size=16, symmetric, scale_dtype=float8_e4m3fn
Input activations: 4-bit float, dynamic local, group_size=16, symmetric, scale_dtype=float8_e4m3fn
Ignored layers: lm_head
Calibration dataset: HuggingFaceH4/ultrachat_200k, split train_sft[:512], max_seq_len 2048
Calibration runtime: ~1.5 h on dual RTX 3090 Ti (sequential per-layer offload)

Calibration caveat

Calibration used ultrachat_200k — an English-only synthetic conversation dataset. ALIA is BSC's Iberian-multilingual model, so this is suboptimal: the per-tensor and per-block scales were computed against an activation distribution that doesn't fully represent the model's actual use case. Re-quantizing with a multilingual instruction dataset (Iberian languages + code) would likely improve quality on Spanish/Catalan/Basque/Galician outputs. PRs welcome.

License & attribution

This quantized model is released under the same Apache 2.0 license as the source.

Base model: BSC-LT/ALIA-40b-instruct-2601 by Barcelona Supercomputing Center (BSC). Please cite their work if you use this model in research:

@misc{alia-40b-instruct,
  author = {Barcelona Supercomputing Center},
  title  = {ALIA-40b-instruct},
  year   = {2026},
  url    = {https://huggingface.co/BSC-LT/ALIA-40b-instruct-2601}
}

NVFP4 quantization: Montevive AI.

Limitations

Inherits all limitations of the base ALIA-40b-instruct-2601 model. Additionally:

NVFP4 inference quality is below BF16/FP8. Empirically fine on 40B but you should evaluate on your own task before deploying.
NVFP4 + Blackwell + vLLM is recent — expect API churn and occasional kernel issues. Tested against vLLM 0.20.0 / torch 2.11+cu130 / flashinfer 0.6.8.post1.
transformers users will fall back to BF16 (no native NVFP4 in stock transformers as of May 2026). Use vLLM or TRT-LLM for the speed/memory benefit.
Calibration was English-only (see above) — multilingual tasks may benefit from re-quantization with Iberian-language calibration data.

Downloads last month: 34

Safetensors

Model size

25B params

Tensor type

BF16

F32

F8_E4M3

Model tree for montevive/ALIA-40b-instruct-2601-NVFP4

Base model

BSC-LT/ALIA-40b

Finetuned

BSC-LT/ALIA-40b-instruct-2601

Quantized

(10)

this model