Apriel 1.6 15B Thinker - NVFP4
This is an NVFP4 quantized version of ServiceNow-AI/Apriel-1.6-15b-Thinker, optimized for NVIDIA Blackwell GPUs (RTX 50xx series).
✅ FULLY FUNCTIONAL - This model loads and generates correctly in vLLM 0.15.1!
Note: HuggingFace may incorrectly display this as 9B params due to quantization - this is a 15B parameter model compressed to 4-bit precision.
Model Details
- Base Model: Apriel 1.6 15B Thinker (Multimodal reasoning model)
- Parameters: 15 billion (not 9B - HF auto-detection is confused by quantization)
- Quantization: NVFP4 (4-bit floating point)
- Size: 11GB (down from 29GB, 62% reduction)
- Quantizer: SILVERTHRONE
- Method: llmcompressor with 128 calibration samples
Quantization Strategy
- ✅ Quantized to NVFP4: All linear layers
- ❌ Preserved at full precision: Vision encoder, embeddings, lm_head, multimodal projector
Calibration:
- Dataset: open-platypus
- Samples: 128
- Sequence length: 512
This selective approach maintains multimodal quality while achieving significant size reduction.
Hardware Requirements
- GPU: NVIDIA RTX 50xx series (Blackwell) or newer
- VRAM: 14-16GB minimum
- Framework: vLLM 0.15.1+
Note: This model will NOT work with llama.cpp, transformers library, or GGUF-based tools.
Usage
With vLLM
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
# Load tokenizer for chat template
tokenizer = AutoTokenizer.from_pretrained("SILVERTHRONE/apriel-1.6-15b-thinker-nvfp4")
# Load model (CRITICAL: set max_model_len=2048 and gpu_memory_utilization=0.90)
model = LLM(
model="SILVERTHRONE/apriel-1.6-15b-thinker-nvfp4",
gpu_memory_utilization=0.90,
max_model_len=2048, # Required! Model is 11GB, need tight memory settings
enforce_eager=True,
max_num_seqs=1
)
# Format message with chat template
messages = [{"role": "user", "content": "What is 2+2?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Generate
sampling_params = SamplingParams(
temperature=0.7,
max_tokens=200
)
outputs = model.generate(prompt, sampling_params)
response = outputs[0].outputs[0].text
print(response)
Chat Template
Apriel uses a custom reasoning format with [BEGIN FINAL RESPONSE] marker:
<s><|begin_system|>
You are a thoughtful, systematic AI assistant from ServiceNow Language Models (SLAM) lab.
Analyze each question carefully, present your reasoning step-by-step, then provide the
final response after the marker [BEGIN FINAL RESPONSE].
<|begin_user|>
What is 2+2?
<|begin_assistant|>
Here are my reasoning steps:
The user asks "What is 2+2?" It's a simple arithmetic question. The answer is 4.
[BEGIN FINAL RESPONSE]
2 + 2 = 4.
The model outputs reasoning steps first, then marks the final answer with [BEGIN FINAL RESPONSE]. Parse responses to extract the answer after this marker.
Performance
- Inference Speed: ~25 tokens/sec output on RTX 5060 Ti (16GB)
- Quality: Good - coherent reasoning, correct answers on factual questions
- Memory: Requires 14GB+ VRAM with tight settings
Example output:
Input: "What is 2+2?"
Output:
The user asks "What is 2+2?" It's a simple arithmetic question. The answer is 4.
[BEGIN FINAL RESPONSE]
2 + 2 = 4.
Known Limitations
- Does NOT work with transformers library - Crashes with
KeyError: 'weight_scale'due to incomplete NVFP4 support in compressed-tensors for VLM architectures - vLLM only - This is by design; NVFP4 is optimized for vLLM inference
- Tight memory requirements - Must use max_model_len=2048 and gpu_memory_utilization=0.90 to fit in 16GB VRAM
- Small context window - Limited to 2048 tokens due to VRAM constraints on 16GB GPUs
- VLM-specific bug - The weight_scale error affects vision-language models but not pure text models
Testing Environment
- GPU: NVIDIA GeForce RTX 5060 Ti (16GB VRAM, Blackwell)
- CPU: AMD Ryzen 9 5950X
- RAM: 32GB DDR4 2666MHz
- OS: Ubuntu 24.04.3 LTS (Noble)
- vLLM: 0.15.1
- transformers: 4.57.3 (for tokenizer only)
- llmcompressor: Latest (Feb 2026)
- NVIDIA Driver: 580.126.09
Comparison: VLM NVFP4 Challenges
Vision-language models (LLaVA architecture like Apriel) have additional challenges with NVFP4:
- transformers library cannot load them (weight_scale bug)
- Larger memory footprint (11GB model + vision encoder)
- Requires very tight vLLM settings
Pure text models (like Nanbeige 4.1 3B) are easier to run with more generous settings.
Credits
- Original Model: ServiceNow-AI
- Quantization: SILVERTHRONE (@SILVERTHRONE)
- Method: Neural Magic llmcompressor
License
MIT (inherited from base model)
- Downloads last month
- 68
Model tree for SILVERTHRONE/apriel-1.6-15b-thinker-nvfp4
Base model
ServiceNow-AI/Apriel-1.6-15b-Thinker