You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Granite 4.1 8B NVFP4A16 — vLLM / RTX 5060 Ti Test

This repository contains a locally quantized Granite 4.1 8B model using NVFP4A16 / compressed-tensors, tested with vLLM nightly on a consumer NVIDIA Blackwell GPU.

This model card section documents the local runtime test performed by Murilo Vieira on RTX 5060 Ti 16GB.

Quantization Summary

Item	Value
Base model	Granite 4.1 8B
Quantization format	NVFP4A16
Runtime quantization loader	`compressed-tensors`
Runtime	vLLM OpenAI server
vLLM version tested	`0.21.1rc1.dev46+gb50646e5e`
Model architecture resolved by vLLM	`GraniteForCausalLM`
Runtime dtype	`torch.bfloat16`
KV cache dtype	`fp8_e4m3`
Attention backend	FlashInfer
Tensor parallel size	1
Pipeline parallel size	1
Data parallel size	1
Language model only	true

Tested Hardware

Component	Configuration
GPU	NVIDIA GeForce RTX 5060 Ti
GPU VRAM	16 GB
CPU	Intel Xeon E5-2680 v4
System RAM	64 GB
OS / runtime	Docker + NVIDIA Container Runtime
Container image	`vllm/vllm-openai:nightly-x86_64`

vLLM Runtime Configuration

The following settings were used during the test:

model: /ibm-granite-v0/granite-4.1-8b-NVFP4
dtype: auto
attention_backend: flashinfer
gpu_memory_utilization: 0.90
max_model_len: 65365
max_num_batched_tokens: 16384
max_num_seqs: 8

enable_auto_tool_choice: true
tool_call_parser: granite4
reasoning_parser: granite

chat_template: /ibm-granite-v0/granite-4.1-8b-NVFP4/chat_template.jinja
chat_template_content_format: openai

enable_chunked_prefill: true
performance_mode: interactivity
enforce_eager: true
trust_remote_code: true
tensor_parallel_size: 1

port: 8001
served_model_name:
  - granite-4.0

structured_outputs_config:
  backend: xgrammar
  disable_any_whitespace: true
  enable_in_reasoning: false

kv_cache_dtype: fp8_e4m3
enable_prefix_caching: true
prefix_caching_hash_algo: xxhash

async_scheduling: true
generation_config: vllm
language_model_only: true

Startup and Memory Statistics

Metric	Result
Checkpoint size reported by vLLM	4.94 GiB
Available system RAM reported by vLLM	40.65 GiB
Weight loading time	1.81 s
Model loading GPU memory	4.97 GiB
Model loading time	2.59 s
Available KV cache memory	7.18 GiB
GPU KV cache size	94,112 tokens
Max configured context length	65,365 tokens
Estimated max concurrency at full context	1.44x
Engine init time	9.16 s
Server port	8001

Throughput Observed

The following values were observed from vLLM runtime logs during interactive testing:

Scenario	Prompt throughput	Generation throughput	Running requests	GPU KV usage
First active request	4.0 tok/s	18.2 tok/s	1	0.2%
Steady single request	0.0 tok/s	26.9 tok/s	1	0.5%
Steady single request	0.0 tok/s	26.9 tok/s	1	0.8%
Steady single request	0.0 tok/s	27.0 tok/s	1	1.1%
Steady single request	0.0 tok/s	27.0 tok/s	1	1.4%
Prompt-heavy request	165.3 tok/s	12.1 tok/s	0	0.0%

Important Notes

The model successfully started under vLLM with quantization=compressed-tensors.
The runtime used kv_cache_dtype: fp8_e4m3, which reduces KV cache memory usage and allows a very long configured context window.
vLLM warns that FP8 KV cache may cause accuracy degradation without a proper scaling factor.
enforce_eager: true disables torch.compile and CUDA Graphs. This reduced startup/compile overhead and made the server start quickly in this test.
FlashInfer was used as the attention backend and for sampling.
The server successfully exposed OpenAI-compatible endpoints, including /v1/chat/completions, /v1/completions, /v1/responses, /v1/models, /health, and /metrics.

Known Runtime Warnings

The following warnings were observed and are not necessarily fatal:

VLLM_BUILD_COMMIT, VLLM_BUILD_PIPELINE, VLLM_BUILD_URL, and VLLM_IMAGE_TAG were detected as unknown vLLM environment variables.
Reasoning token ID auto-initialization failed for the Granite reasoning parser.
torch.compile and CUDA Graphs were disabled due to enforce_eager: true.
A Triton JIT compilation warning occurred during inference for _compute_slot_mapping_kernel.

Docker Compose Used

vllm-granite:
  image: vllm/vllm-openai:nightly-x86_64
  container_name: vllm-granite
  hostname: vllm-granite
  restart: unless-stopped
  runtime: nvidia
  network_mode: host
  ipc: host
  shm_size: '16gb'
  volumes:
    - /mnt/dados/storage/cache/huggingface:/root/.cache/huggingface
    - /mnt/dados/storage/cache/vllm-cache:/root/.cache/vllm
    - /mnt/dados/storage/models/ibm-granite:/ibm-granite-v0:ro
    - /mnt/dados/storage/config:/configs:ro
    - /opt/models/ibm-granite:/ibm-granite-v1:ro
    - /mnt/nas_models/ibm-granite:/ibm-granite-v2:ro
  environment:
    - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
    - TORCH_CUDA_ARCH_LIST=12.0
    - CUDA_DEVICE_ORDER=PCI_BUS_ID
    - NVIDIA_VISIBLE_DEVICES=all
    - NVIDIA_DRIVER_CAPABILITIES=all
    - VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
    - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    - OMP_NUM_THREADS=14
    - VLLM_NO_USAGE_STATS=1
    - DO_NOT_TRACK=1
    - TORCH_MATMUL_PRECISION=high
    - NVIDIA_FORWARD_COMPAT=1
    - NVIDIA_DISABLE_REQUIRE=1
    - VLLM_USE_FLASHINFER_SAMPLER=1
    - VLLM_NVFP4_GEMM_BACKEND=marlin
    - VLLM_TARGET_DEVICE=cuda
    - VLLM_FLOAT32_MATMUL_PRECISION=high
    - VLLM_USE_STANDALONE_COMPILE=1
    - VLLM_ENABLE_V1_MULTIPROCESSING=1
    - VLLM_V1_OUTPUT_PROC_CHUNK_SIZE=128
    - VLLM_SKIP_MODEL_NAME_VALIDATION=1
    - VLLM_LOGGING_LEVEL=INFO
    - VLLM_LOG_STATS_INTERVAL=10
    - VLLM_XGRAMMAR_CACHE_MB=1024
  command: >
    --config /configs/Granite.yaml
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            device_ids: ['1']
            capabilities: [gpu]

Suggested vLLM Command

vllm serve /ibm-granite-v0/granite-4.1-8b-NVFP4 \
  --trust-remote-code \
  --served-model-name granite-4.0 \
  --max-model-len 65365 \
  --gpu-memory-utilization 0.90 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 8 \
  --kv-cache-dtype fp8_e4m3 \
  --attention-backend flashinfer \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --prefix-caching-hash-algo xxhash \
  --performance-mode interactivity \
  --enforce-eager \
  --port 8001

Reproducibility Notes

These results are from a single local test run and should be treated as environment-specific. Throughput may vary depending on:

prompt length,
output length,
sampling settings,
driver/CUDA versions,
vLLM nightly build,
whether torch.compile/CUDA Graphs are enabled,
whether FlashInfer autotuning is enabled,
KV cache dtype and context length.

Status

✅ Loaded successfully
✅ OpenAI-compatible server started
✅ Long context configured at 65,365 tokens
✅ FP8 KV cache enabled
✅ Tool parser enabled with granite4
✅ Tested on RTX 5060 Ti 16GB

Downloads last month: 2

Safetensors

Model size

5B params

Tensor type

F32

BF16

F8_E4M3

Model tree for murilonwt/granite-4.1-8b-NVFP4

Base model

ibm-granite/granite-4.1-8b

Quantized

(51)

this model