You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Granite 4.1 8B NVFP4A16 — vLLM / RTX 5060 Ti Test

This repository contains a locally quantized Granite 4.1 8B model using NVFP4A16 / compressed-tensors, tested with vLLM nightly on a consumer NVIDIA Blackwell GPU.

This model card section documents the local runtime test performed by Murilo Vieira on RTX 5060 Ti 16GB.

Quantization Summary

Item Value
Base model Granite 4.1 8B
Quantization format NVFP4A16
Runtime quantization loader compressed-tensors
Runtime vLLM OpenAI server
vLLM version tested 0.21.1rc1.dev46+gb50646e5e
Model architecture resolved by vLLM GraniteForCausalLM
Runtime dtype torch.bfloat16
KV cache dtype fp8_e4m3
Attention backend FlashInfer
Tensor parallel size 1
Pipeline parallel size 1
Data parallel size 1
Language model only true

Tested Hardware

Component Configuration
GPU NVIDIA GeForce RTX 5060 Ti
GPU VRAM 16 GB
CPU Intel Xeon E5-2680 v4
System RAM 64 GB
OS / runtime Docker + NVIDIA Container Runtime
Container image vllm/vllm-openai:nightly-x86_64

vLLM Runtime Configuration

The following settings were used during the test:

model: /ibm-granite-v0/granite-4.1-8b-NVFP4
dtype: auto
attention_backend: flashinfer
gpu_memory_utilization: 0.90
max_model_len: 65365
max_num_batched_tokens: 16384
max_num_seqs: 8

enable_auto_tool_choice: true
tool_call_parser: granite4
reasoning_parser: granite

chat_template: /ibm-granite-v0/granite-4.1-8b-NVFP4/chat_template.jinja
chat_template_content_format: openai

enable_chunked_prefill: true
performance_mode: interactivity
enforce_eager: true
trust_remote_code: true
tensor_parallel_size: 1

port: 8001
served_model_name:
  - granite-4.0

structured_outputs_config:
  backend: xgrammar
  disable_any_whitespace: true
  enable_in_reasoning: false

kv_cache_dtype: fp8_e4m3
enable_prefix_caching: true
prefix_caching_hash_algo: xxhash

async_scheduling: true
generation_config: vllm
language_model_only: true

Startup and Memory Statistics

Metric Result
Checkpoint size reported by vLLM 4.94 GiB
Available system RAM reported by vLLM 40.65 GiB
Weight loading time 1.81 s
Model loading GPU memory 4.97 GiB
Model loading time 2.59 s
Available KV cache memory 7.18 GiB
GPU KV cache size 94,112 tokens
Max configured context length 65,365 tokens
Estimated max concurrency at full context 1.44x
Engine init time 9.16 s
Server port 8001

Throughput Observed

The following values were observed from vLLM runtime logs during interactive testing:

Scenario Prompt throughput Generation throughput Running requests GPU KV usage
First active request 4.0 tok/s 18.2 tok/s 1 0.2%
Steady single request 0.0 tok/s 26.9 tok/s 1 0.5%
Steady single request 0.0 tok/s 26.9 tok/s 1 0.8%
Steady single request 0.0 tok/s 27.0 tok/s 1 1.1%
Steady single request 0.0 tok/s 27.0 tok/s 1 1.4%
Prompt-heavy request 165.3 tok/s 12.1 tok/s 0 0.0%

Important Notes

  • The model successfully started under vLLM with quantization=compressed-tensors.
  • The runtime used kv_cache_dtype: fp8_e4m3, which reduces KV cache memory usage and allows a very long configured context window.
  • vLLM warns that FP8 KV cache may cause accuracy degradation without a proper scaling factor.
  • enforce_eager: true disables torch.compile and CUDA Graphs. This reduced startup/compile overhead and made the server start quickly in this test.
  • FlashInfer was used as the attention backend and for sampling.
  • The server successfully exposed OpenAI-compatible endpoints, including /v1/chat/completions, /v1/completions, /v1/responses, /v1/models, /health, and /metrics.

Known Runtime Warnings

The following warnings were observed and are not necessarily fatal:

  • VLLM_BUILD_COMMIT, VLLM_BUILD_PIPELINE, VLLM_BUILD_URL, and VLLM_IMAGE_TAG were detected as unknown vLLM environment variables.
  • Reasoning token ID auto-initialization failed for the Granite reasoning parser.
  • torch.compile and CUDA Graphs were disabled due to enforce_eager: true.
  • A Triton JIT compilation warning occurred during inference for _compute_slot_mapping_kernel.

Docker Compose Used

vllm-granite:
  image: vllm/vllm-openai:nightly-x86_64
  container_name: vllm-granite
  hostname: vllm-granite
  restart: unless-stopped
  runtime: nvidia
  network_mode: host
  ipc: host
  shm_size: '16gb'
  volumes:
    - /mnt/dados/storage/cache/huggingface:/root/.cache/huggingface
    - /mnt/dados/storage/cache/vllm-cache:/root/.cache/vllm
    - /mnt/dados/storage/models/ibm-granite:/ibm-granite-v0:ro
    - /mnt/dados/storage/config:/configs:ro
    - /opt/models/ibm-granite:/ibm-granite-v1:ro
    - /mnt/nas_models/ibm-granite:/ibm-granite-v2:ro
  environment:
    - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
    - TORCH_CUDA_ARCH_LIST=12.0
    - CUDA_DEVICE_ORDER=PCI_BUS_ID
    - NVIDIA_VISIBLE_DEVICES=all
    - NVIDIA_DRIVER_CAPABILITIES=all
    - VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
    - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    - OMP_NUM_THREADS=14
    - VLLM_NO_USAGE_STATS=1
    - DO_NOT_TRACK=1
    - TORCH_MATMUL_PRECISION=high
    - NVIDIA_FORWARD_COMPAT=1
    - NVIDIA_DISABLE_REQUIRE=1
    - VLLM_USE_FLASHINFER_SAMPLER=1
    - VLLM_NVFP4_GEMM_BACKEND=marlin
    - VLLM_TARGET_DEVICE=cuda
    - VLLM_FLOAT32_MATMUL_PRECISION=high
    - VLLM_USE_STANDALONE_COMPILE=1
    - VLLM_ENABLE_V1_MULTIPROCESSING=1
    - VLLM_V1_OUTPUT_PROC_CHUNK_SIZE=128
    - VLLM_SKIP_MODEL_NAME_VALIDATION=1
    - VLLM_LOGGING_LEVEL=INFO
    - VLLM_LOG_STATS_INTERVAL=10
    - VLLM_XGRAMMAR_CACHE_MB=1024
  command: >
    --config /configs/Granite.yaml
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            device_ids: ['1']
            capabilities: [gpu]

Suggested vLLM Command

vllm serve /ibm-granite-v0/granite-4.1-8b-NVFP4 \
  --trust-remote-code \
  --served-model-name granite-4.0 \
  --max-model-len 65365 \
  --gpu-memory-utilization 0.90 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 8 \
  --kv-cache-dtype fp8_e4m3 \
  --attention-backend flashinfer \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --prefix-caching-hash-algo xxhash \
  --performance-mode interactivity \
  --enforce-eager \
  --port 8001

Reproducibility Notes

These results are from a single local test run and should be treated as environment-specific. Throughput may vary depending on:

  • prompt length,
  • output length,
  • sampling settings,
  • driver/CUDA versions,
  • vLLM nightly build,
  • whether torch.compile/CUDA Graphs are enabled,
  • whether FlashInfer autotuning is enabled,
  • KV cache dtype and context length.

Status

✅ Loaded successfully
✅ OpenAI-compatible server started
✅ Long context configured at 65,365 tokens
✅ FP8 KV cache enabled
✅ Tool parser enabled with granite4
✅ Tested on RTX 5060 Ti 16GB

Downloads last month
-
Safetensors
Model size
5B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for murilonwt/granite-4.1-8b-NVFP4

Quantized
(40)
this model