Granite 4.1 8B NVFP4A16 — vLLM / RTX 5060 Ti Test
This repository contains a locally quantized Granite 4.1 8B model using NVFP4A16 / compressed-tensors, tested with vLLM nightly on a consumer NVIDIA Blackwell GPU.
This model card section documents the local runtime test performed by Murilo Vieira on RTX 5060 Ti 16GB.
Quantization Summary
| Item | Value |
|---|---|
| Base model | Granite 4.1 8B |
| Quantization format | NVFP4A16 |
| Runtime quantization loader | compressed-tensors |
| Runtime | vLLM OpenAI server |
| vLLM version tested | 0.21.1rc1.dev46+gb50646e5e |
| Model architecture resolved by vLLM | GraniteForCausalLM |
| Runtime dtype | torch.bfloat16 |
| KV cache dtype | fp8_e4m3 |
| Attention backend | FlashInfer |
| Tensor parallel size | 1 |
| Pipeline parallel size | 1 |
| Data parallel size | 1 |
| Language model only | true |
Tested Hardware
| Component | Configuration |
|---|---|
| GPU | NVIDIA GeForce RTX 5060 Ti |
| GPU VRAM | 16 GB |
| CPU | Intel Xeon E5-2680 v4 |
| System RAM | 64 GB |
| OS / runtime | Docker + NVIDIA Container Runtime |
| Container image | vllm/vllm-openai:nightly-x86_64 |
vLLM Runtime Configuration
The following settings were used during the test:
model: /ibm-granite-v0/granite-4.1-8b-NVFP4
dtype: auto
attention_backend: flashinfer
gpu_memory_utilization: 0.90
max_model_len: 65365
max_num_batched_tokens: 16384
max_num_seqs: 8
enable_auto_tool_choice: true
tool_call_parser: granite4
reasoning_parser: granite
chat_template: /ibm-granite-v0/granite-4.1-8b-NVFP4/chat_template.jinja
chat_template_content_format: openai
enable_chunked_prefill: true
performance_mode: interactivity
enforce_eager: true
trust_remote_code: true
tensor_parallel_size: 1
port: 8001
served_model_name:
- granite-4.0
structured_outputs_config:
backend: xgrammar
disable_any_whitespace: true
enable_in_reasoning: false
kv_cache_dtype: fp8_e4m3
enable_prefix_caching: true
prefix_caching_hash_algo: xxhash
async_scheduling: true
generation_config: vllm
language_model_only: true
Startup and Memory Statistics
| Metric | Result |
|---|---|
| Checkpoint size reported by vLLM | 4.94 GiB |
| Available system RAM reported by vLLM | 40.65 GiB |
| Weight loading time | 1.81 s |
| Model loading GPU memory | 4.97 GiB |
| Model loading time | 2.59 s |
| Available KV cache memory | 7.18 GiB |
| GPU KV cache size | 94,112 tokens |
| Max configured context length | 65,365 tokens |
| Estimated max concurrency at full context | 1.44x |
| Engine init time | 9.16 s |
| Server port | 8001 |
Throughput Observed
The following values were observed from vLLM runtime logs during interactive testing:
| Scenario | Prompt throughput | Generation throughput | Running requests | GPU KV usage |
|---|---|---|---|---|
| First active request | 4.0 tok/s | 18.2 tok/s | 1 | 0.2% |
| Steady single request | 0.0 tok/s | 26.9 tok/s | 1 | 0.5% |
| Steady single request | 0.0 tok/s | 26.9 tok/s | 1 | 0.8% |
| Steady single request | 0.0 tok/s | 27.0 tok/s | 1 | 1.1% |
| Steady single request | 0.0 tok/s | 27.0 tok/s | 1 | 1.4% |
| Prompt-heavy request | 165.3 tok/s | 12.1 tok/s | 0 | 0.0% |
Important Notes
- The model successfully started under vLLM with
quantization=compressed-tensors. - The runtime used
kv_cache_dtype: fp8_e4m3, which reduces KV cache memory usage and allows a very long configured context window. - vLLM warns that FP8 KV cache may cause accuracy degradation without a proper scaling factor.
enforce_eager: truedisablestorch.compileand CUDA Graphs. This reduced startup/compile overhead and made the server start quickly in this test.- FlashInfer was used as the attention backend and for sampling.
- The server successfully exposed OpenAI-compatible endpoints, including
/v1/chat/completions,/v1/completions,/v1/responses,/v1/models,/health, and/metrics.
Known Runtime Warnings
The following warnings were observed and are not necessarily fatal:
VLLM_BUILD_COMMIT,VLLM_BUILD_PIPELINE,VLLM_BUILD_URL, andVLLM_IMAGE_TAGwere detected as unknown vLLM environment variables.- Reasoning token ID auto-initialization failed for the Granite reasoning parser.
torch.compileand CUDA Graphs were disabled due toenforce_eager: true.- A Triton JIT compilation warning occurred during inference for
_compute_slot_mapping_kernel.
Docker Compose Used
vllm-granite:
image: vllm/vllm-openai:nightly-x86_64
container_name: vllm-granite
hostname: vllm-granite
restart: unless-stopped
runtime: nvidia
network_mode: host
ipc: host
shm_size: '16gb'
volumes:
- /mnt/dados/storage/cache/huggingface:/root/.cache/huggingface
- /mnt/dados/storage/cache/vllm-cache:/root/.cache/vllm
- /mnt/dados/storage/models/ibm-granite:/ibm-granite-v0:ro
- /mnt/dados/storage/config:/configs:ro
- /opt/models/ibm-granite:/ibm-granite-v1:ro
- /mnt/nas_models/ibm-granite:/ibm-granite-v2:ro
environment:
- VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
- TORCH_CUDA_ARCH_LIST=12.0
- CUDA_DEVICE_ORDER=PCI_BUS_ID
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=all
- VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
- OMP_NUM_THREADS=14
- VLLM_NO_USAGE_STATS=1
- DO_NOT_TRACK=1
- TORCH_MATMUL_PRECISION=high
- NVIDIA_FORWARD_COMPAT=1
- NVIDIA_DISABLE_REQUIRE=1
- VLLM_USE_FLASHINFER_SAMPLER=1
- VLLM_NVFP4_GEMM_BACKEND=marlin
- VLLM_TARGET_DEVICE=cuda
- VLLM_FLOAT32_MATMUL_PRECISION=high
- VLLM_USE_STANDALONE_COMPILE=1
- VLLM_ENABLE_V1_MULTIPROCESSING=1
- VLLM_V1_OUTPUT_PROC_CHUNK_SIZE=128
- VLLM_SKIP_MODEL_NAME_VALIDATION=1
- VLLM_LOGGING_LEVEL=INFO
- VLLM_LOG_STATS_INTERVAL=10
- VLLM_XGRAMMAR_CACHE_MB=1024
command: >
--config /configs/Granite.yaml
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['1']
capabilities: [gpu]
Suggested vLLM Command
vllm serve /ibm-granite-v0/granite-4.1-8b-NVFP4 \
--trust-remote-code \
--served-model-name granite-4.0 \
--max-model-len 65365 \
--gpu-memory-utilization 0.90 \
--max-num-batched-tokens 16384 \
--max-num-seqs 8 \
--kv-cache-dtype fp8_e4m3 \
--attention-backend flashinfer \
--enable-chunked-prefill \
--enable-prefix-caching \
--prefix-caching-hash-algo xxhash \
--performance-mode interactivity \
--enforce-eager \
--port 8001
Reproducibility Notes
These results are from a single local test run and should be treated as environment-specific. Throughput may vary depending on:
- prompt length,
- output length,
- sampling settings,
- driver/CUDA versions,
- vLLM nightly build,
- whether
torch.compile/CUDA Graphs are enabled, - whether FlashInfer autotuning is enabled,
- KV cache dtype and context length.
Status
✅ Loaded successfully
✅ OpenAI-compatible server started
✅ Long context configured at 65,365 tokens
✅ FP8 KV cache enabled
✅ Tool parser enabled with granite4
✅ Tested on RTX 5060 Ti 16GB
- Downloads last month
- -
Model tree for murilonwt/granite-4.1-8b-NVFP4
Base model
ibm-granite/granite-4.1-8b