--- license: apache-2.0 base_model: Qwen/Qwen2.5-3B-Instruct library_name: vllm pipeline_tag: text-generation tags: - qwen - qwen2.5 - nvfp4 - nvfp4a16 - compressed-tensors - vllm - blackwell - rtx-50-series - flashattention --- # Qwen2.5-3B-Instruct NVFP4A16 — vLLM / RTX 5060 Ti This repository contains a locally quantized **Qwen2.5-3B-Instruct** model using **NVFP4A16 / compressed-tensors** for vLLM inference on NVIDIA Blackwell GPUs. All Linear layers (except `lm_head`) were quantized to NVFP4A16 using `llmcompressor` + `compressed-tensors`. > This model card documents the local quantization and test performed on RTX 5060 Ti 16GB. ## Quantization Summary | Item | Value | |---|---| | Base model | Qwen/Qwen2.5-3B-Instruct | | Architecture | `Qwen2ForCausalLM` | | Hidden size | 2048 | | Layers | 36 | | Attention heads | 16 (KV: 2) | | Context length | 32,768 | | Vocab size | 151,936 | | Quantization format | NVFP4A16 | | Compressed size | ~2.7 GB | | Quantization config | `compressed-tensors` | ## Tested Hardware | Component | Configuration | |---|---| | GPU | NVIDIA GeForce RTX 5060 Ti 16 GB | | CPU | Intel Xeon E5-2680 v4 | | System RAM | 64 GB | | Runtime | Docker + NVIDIA Container Runtime | | Container image | `vllm/vllm-openai:v0.22.0-ubuntu2404` | ## Suggested vLLM Command ```bash vllm serve /models/Qwen2.5-3B-Instruct-NVFP4 \ --trust-remote-code \ --served-model-name Qwen2.5-3B \ --max-model-len 32768 \ --gpu-memory-utilization 0.93 \ --max-num-batched-tokens 8192 \ --max-num-seqs 4 \ --tensor-parallel-size 1 \ --enforce-eager \ --port 8000 ``` ## Status - [x] Quantized to NVFP4A16 (compressed-tensors) - [x] Tested on RTX 5060 Ti 16 GB - [x] Tested with vLLM v0.22.0 Docker image - [x] Loads successfully in vLLM