Apriel 1.6 15B Thinker - NVFP4

This is an NVFP4 quantized version of ServiceNow-AI/Apriel-1.6-15b-Thinker, optimized for NVIDIA Blackwell GPUs (RTX 50xx series).

✅ FULLY FUNCTIONAL - This model loads and generates correctly in vLLM 0.15.1!

Note: HuggingFace may incorrectly display this as 9B params due to quantization - this is a 15B parameter model compressed to 4-bit precision.

Model Details

  • Base Model: Apriel 1.6 15B Thinker (Multimodal reasoning model)
  • Parameters: 15 billion (not 9B - HF auto-detection is confused by quantization)
  • Quantization: NVFP4 (4-bit floating point)
  • Size: 11GB (down from 29GB, 62% reduction)
  • Quantizer: SILVERTHRONE
  • Method: llmcompressor with 128 calibration samples

Quantization Strategy

  • Quantized to NVFP4: All linear layers
  • Preserved at full precision: Vision encoder, embeddings, lm_head, multimodal projector

Calibration:

  • Dataset: open-platypus
  • Samples: 128
  • Sequence length: 512

This selective approach maintains multimodal quality while achieving significant size reduction.

Hardware Requirements

  • GPU: NVIDIA RTX 50xx series (Blackwell) or newer
  • VRAM: 14-16GB minimum
  • Framework: vLLM 0.15.1+

Note: This model will NOT work with llama.cpp, transformers library, or GGUF-based tools.

Usage

With vLLM

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

# Load tokenizer for chat template
tokenizer = AutoTokenizer.from_pretrained("SILVERTHRONE/apriel-1.6-15b-thinker-nvfp4")

# Load model (CRITICAL: set max_model_len=2048 and gpu_memory_utilization=0.90)
model = LLM(
    model="SILVERTHRONE/apriel-1.6-15b-thinker-nvfp4",
    gpu_memory_utilization=0.90,
    max_model_len=2048,  # Required! Model is 11GB, need tight memory settings
    enforce_eager=True,
    max_num_seqs=1
)

# Format message with chat template
messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=200
)

outputs = model.generate(prompt, sampling_params)
response = outputs[0].outputs[0].text

print(response)

Chat Template

Apriel uses a custom reasoning format with [BEGIN FINAL RESPONSE] marker:

<s><|begin_system|>
You are a thoughtful, systematic AI assistant from ServiceNow Language Models (SLAM) lab. 
Analyze each question carefully, present your reasoning step-by-step, then provide the 
final response after the marker [BEGIN FINAL RESPONSE].
<|begin_user|>
What is 2+2?
<|begin_assistant|>
Here are my reasoning steps:
The user asks "What is 2+2?" It's a simple arithmetic question. The answer is 4.
[BEGIN FINAL RESPONSE]
2 + 2 = 4.

The model outputs reasoning steps first, then marks the final answer with [BEGIN FINAL RESPONSE]. Parse responses to extract the answer after this marker.

Performance

  • Inference Speed: ~25 tokens/sec output on RTX 5060 Ti (16GB)
  • Quality: Good - coherent reasoning, correct answers on factual questions
  • Memory: Requires 14GB+ VRAM with tight settings

Example output:

Input: "What is 2+2?"
Output: 
The user asks "What is 2+2?" It's a simple arithmetic question. The answer is 4.
[BEGIN FINAL RESPONSE]
2 + 2 = 4.

Known Limitations

  1. Does NOT work with transformers library - Crashes with KeyError: 'weight_scale' due to incomplete NVFP4 support in compressed-tensors for VLM architectures
  2. vLLM only - This is by design; NVFP4 is optimized for vLLM inference
  3. Tight memory requirements - Must use max_model_len=2048 and gpu_memory_utilization=0.90 to fit in 16GB VRAM
  4. Small context window - Limited to 2048 tokens due to VRAM constraints on 16GB GPUs
  5. VLM-specific bug - The weight_scale error affects vision-language models but not pure text models

Testing Environment

  • GPU: NVIDIA GeForce RTX 5060 Ti (16GB VRAM, Blackwell)
  • CPU: AMD Ryzen 9 5950X
  • RAM: 32GB DDR4 2666MHz
  • OS: Ubuntu 24.04.3 LTS (Noble)
  • vLLM: 0.15.1
  • transformers: 4.57.3 (for tokenizer only)
  • llmcompressor: Latest (Feb 2026)
  • NVIDIA Driver: 580.126.09

Comparison: VLM NVFP4 Challenges

Vision-language models (LLaVA architecture like Apriel) have additional challenges with NVFP4:

  • transformers library cannot load them (weight_scale bug)
  • Larger memory footprint (11GB model + vision encoder)
  • Requires very tight vLLM settings

Pure text models (like Nanbeige 4.1 3B) are easier to run with more generous settings.

Credits

License

MIT (inherited from base model)

Downloads last month
68
Safetensors
Model size
9B params
Tensor type
BF16
·
F32
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SILVERTHRONE/apriel-1.6-15b-thinker-nvfp4

Quantized
(16)
this model