Qwen3-14B NVFP4A16 — vLLM / RTX 5060 Ti

This repository contains a locally quantized Qwen3-14B model using NVFP4A16 / compressed-tensors for vLLM inference on NVIDIA Blackwell GPUs.

All Linear layers (except lm_head) were quantized to NVFP4A16 using llmcompressor + compressed-tensors.

This model warns the 14B model likely requires more VRAM than a single RTX 5060 Ti 16GB can provide at full context. Multi-GPU or reduced context length is recommended.

This model card documents the local quantization and test performed on RTX 5060 Ti 16GB.

Quantization Summary

Item Value
Base model Qwen/Qwen3-14B
Architecture Qwen3ForCausalLM
Hidden size 5120
Layers 40
Attention heads 40 (KV: 8)
Context length 40,960
Vocab size 151,936
Quantization format NVFP4A16
Compressed size ~9.9 GB
Quantization config compressed-tensors

Tested Hardware

Component Configuration
GPU NVIDIA GeForce RTX 5060 Ti 16 GB
CPU Intel Xeon E5-2680 v4
System RAM 64 GB
Runtime Docker + NVIDIA Container Runtime
Container image vllm/vllm-openai:v0.22.0-ubuntu2404

Suggested vLLM Command

vllm serve /models/Qwen3-14B-NVFP4 \
  --trust-remote-code \
  --served-model-name Qwen3-14B \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.93 \
  --max-num-batched-tokens 4096 \
  --max-num-seqs 2 \
  --tensor-parallel-size 1 \
  --enforce-eager \
  --port 8000

Status

  • Quantized to NVFP4A16 (compressed-tensors)
  • Tested on RTX 5060 Ti 16 GB
  • Tested with vLLM v0.22.0 Docker image
  • Loads successfully in vLLM
Downloads last month
10
Safetensors
Model size
9B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for murilonwt/Qwen3-14B-NVFP4

Finetuned
Qwen/Qwen3-14B
Quantized
(179)
this model