ALIA-40b-instruct-2601 β NVFP4 (compressed-tensors, vLLM)
NVFP4 quantization of BSC-LT/ALIA-40b-instruct-2601 packaged in compressed-tensors format for vLLM. 27 GB on disk, ~9 tok/s steady-state generation on a single NVIDIA GB10 (DGX Spark) with native NVFP4 hardware kernels via FlashInfer + CUTLASS.
For the GGUF / llama.cpp / Ollama version of this model, see montevive/ALIA-40b-instruct-2601-NVFP4-GGUF.
| File | Format | Size | Use case |
|---|---|---|---|
model-0000{1..6}-of-00006.safetensors |
NVFP4 (compressed-tensors / nvfp4-pack-quantized) |
26.8 GB | Blackwell GPUs (RTX 50xx, GB10, B-series) via vLLM or TRT-LLM |
What's in this repo
- The NVFP4 safetensors shards β produced by llmcompressor 0.10 (NVFP4 calibration on 512 samples of
HuggingFaceH4/ultrachat_200kat sequence length 2048). config.jsonwithquantization_config: {format: "nvfp4-pack-quantized", quant_method: "compressed-tensors"}β vLLM auto-detects NVFP4 from this and selects theFlashInferCutlassNvFp4LinearKernel.chat_template.jinjaβ ALIA's training-format chat template (folds the system prompt into the first user turn; see "Chat template" below).recipe.yamlβ the llmcompressor recipe used to produce the quant.- Tokenizer files β sentencepiece + the GPT2-style BPE merges from the source model.
Recommended runtime: vLLM
# Install vLLM with NVFP4 support (requires CUDA 13 toolkit on the host for flashinfer JIT)
pip install vllm # 0.20+
# DGX Spark / Blackwell: ensure CUDA 13 nvcc is in PATH (otherwise flashinfer JIT fails for sm_120)
export CUDA_HOME=/usr/local/cuda-13.0
export PATH=/usr/local/cuda-13.0/bin:$PATH
# Serve via OpenAI-compatible API
vllm serve montevive/ALIA-40b-instruct-2601-NVFP4 \
--dtype bfloat16 \
--max-model-len 4096 \
--gpu-memory-utilization 0.55 \
--served-model-name alia-nvfp4
Or offline:
from vllm import LLM, SamplingParams
llm = LLM(
model="montevive/ALIA-40b-instruct-2601-NVFP4",
dtype="bfloat16",
max_model_len=4096,
gpu_memory_utilization=0.55,
)
sp = SamplingParams(temperature=0.1, top_p=0.9, max_tokens=200)
out = llm.chat(
[[{"role": "user", "content": "La capital de EspaΓ±a es"}]],
sp,
)
print(out[0].outputs[0].text)
Recommended sampling (per the base model card): temperature between 0 and 0.2; avoid repetition penalties β they degrade instruction-following on this model.
vLLM serving notes for DGX Spark (GB10, unified memory)
- The Spark exposes 121 GB unified memory (no separate VRAM);
gpu_memory_utilization=0.55reserves ~66 GB for vLLM and leaves the rest for OS/page-cache. - After loading the model, the Linux page cache holds the safetensors shards (~27 GB). On subsequent vLLM launches, drop the cache with
sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"if vLLM complains "Free memory on device cuda:0 (X/121 GiB) on startup is less than desired GPU memory utilization." - First launch JIT-compiles the sm_120 NVFP4 CUTLASS kernel (
3-5 min) and torch.compiles the model graph (20 s). Both are cached under~/.cache/flashinfer/and~/.cache/vllm/torch_compile_cache/so subsequent launches drop to ~1 min.
Performance (DGX Spark, GB10 Blackwell, NVFP4)
Measured with vLLM 0.20.0 + torch 2.11.0+cu130 + flashinfer 0.6.8.post1, single-prompt smoke (no batching):
| Prompt | Tokens generated | Time | Throughput |
|---|---|---|---|
| Spanish (warmup) | 162 | 40 s | 4.0 tok/s |
| Catalan | 200 | 23 s | 8.7 tok/s |
| Basque (system override) | 200 | 23 s | 8.8 tok/s |
Steady-state ~8.7 tok/s; the first run is dominated by cudagraph capture / kernel autotune and is not representative. With continuous batching across multiple users vLLM scales considerably higher.
For comparison: the GGUF version runs at ~10.2 tok/s under llama.cpp --jinja on the same hardware. The slight gap is single-prompt overhead in vLLM; for multi-tenant serving vLLM is typically faster.
Chat template
ALIA was trained with a non-standard ChatML variant: the system message is folded into the first user turn, separated by \n\n, instead of being emitted as its own <|im_start|>system ... <|im_end|> block (see BSC's PR #1 on the official GGUF).
The chat_template.jinja shipped in this repo encodes this exactly. vLLM (and transformers.AutoTokenizer.apply_chat_template) honor it automatically. For runtimes that ignore embedded Jinja templates (notably Ollama), see the GGUF repo's Modelfile for a Go-template equivalent.
Compatibility matrix
| Runtime | Native NVFP4 on Blackwell | Notes |
|---|---|---|
| vLLM (recommended) | β Yes (FlashInfer + CUTLASS NVFP4 GEMM) | Production-ready. CUDA 13 nvcc required on host for first-run JIT. |
| TensorRT-LLM | β Yes | Same compressed-tensors format. Heavier setup. |
HuggingFace transformers |
β No (dequant-to-BF16 at load time) | Loads fine via compressed-tensors library, but resident size jumps to ~76 GB and inference runs at BF16 speed. Useful for compatibility/eval, not production. |
| llama.cpp | β Yes (different format) | Use the GGUF repo β compressed-tensors is not GGUF. |
Quantization details
- Tool: llmcompressor 0.10.0.1 + compressed-tensors 0.14.0.1
- Scheme: NVFP4 (
nvfp4-pack-quantized) - Weights: 4-bit float, group_size=16, symmetric, scale_dtype=float8_e4m3fn
- Input activations: 4-bit float, dynamic local, group_size=16, symmetric, scale_dtype=float8_e4m3fn
- Ignored layers:
lm_head - Calibration dataset:
HuggingFaceH4/ultrachat_200k, splittrain_sft[:512], max_seq_len 2048 - Calibration runtime: ~1.5 h on dual RTX 3090 Ti (sequential per-layer offload)
Calibration caveat
Calibration used ultrachat_200k β an English-only synthetic conversation dataset. ALIA is BSC's Iberian-multilingual model, so this is suboptimal: the per-tensor and per-block scales were computed against an activation distribution that doesn't fully represent the model's actual use case. Re-quantizing with a multilingual instruction dataset (Iberian languages + code) would likely improve quality on Spanish/Catalan/Basque/Galician outputs. PRs welcome.
License & attribution
This quantized model is released under the same Apache 2.0 license as the source.
Base model: BSC-LT/ALIA-40b-instruct-2601 by Barcelona Supercomputing Center (BSC). Please cite their work if you use this model in research:
@misc{alia-40b-instruct, author = {Barcelona Supercomputing Center}, title = {ALIA-40b-instruct}, year = {2026}, url = {https://huggingface.co/BSC-LT/ALIA-40b-instruct-2601} }NVFP4 quantization: Montevive AI.
Limitations
Inherits all limitations of the base ALIA-40b-instruct-2601 model. Additionally:
- NVFP4 inference quality is below BF16/FP8. Empirically fine on 40B but you should evaluate on your own task before deploying.
- NVFP4 + Blackwell + vLLM is recent β expect API churn and occasional kernel issues. Tested against vLLM 0.20.0 / torch 2.11+cu130 / flashinfer 0.6.8.post1.
transformersusers will fall back to BF16 (no native NVFP4 in stocktransformersas of May 2026). Use vLLM or TRT-LLM for the speed/memory benefit.- Calibration was English-only (see above) β multilingual tasks may benefit from re-quantization with Iberian-language calibration data.
- Downloads last month
- 34