---
license: apache-2.0
base_model: Qwen/Qwen2.5-3B-Instruct
library_name: vllm
pipeline_tag: text-generation
tags:
  - qwen
  - qwen2.5
  - nvfp4
  - nvfp4a16
  - compressed-tensors
  - vllm
  - blackwell
  - rtx-50-series
  - flashattention
---

# Qwen2.5-3B-Instruct NVFP4A16 — vLLM / RTX 5060 Ti

This repository contains a locally quantized **Qwen2.5-3B-Instruct** model using **NVFP4A16 / compressed-tensors** for vLLM inference on NVIDIA Blackwell GPUs.

All Linear layers (except `lm_head`) were quantized to NVFP4A16 using `llmcompressor` + `compressed-tensors`.

> This model card documents the local quantization and test performed on RTX 5060 Ti 16GB.

## Quantization Summary

| Item | Value |
|---|---|
| Base model | Qwen/Qwen2.5-3B-Instruct |
| Architecture | `Qwen2ForCausalLM` |
| Hidden size | 2048 |
| Layers | 36 |
| Attention heads | 16 (KV: 2) |
| Context length | 32,768 |
| Vocab size | 151,936 |
| Quantization format | NVFP4A16 |
| Compressed size | ~2.7 GB |
| Quantization config | `compressed-tensors` |

## Tested Hardware

| Component | Configuration |
|---|---|
| GPU | NVIDIA GeForce RTX 5060 Ti 16 GB |
| CPU | Intel Xeon E5-2680 v4 |
| System RAM | 64 GB |
| Runtime | Docker + NVIDIA Container Runtime |
| Container image | `vllm/vllm-openai:v0.22.0-ubuntu2404` |

## Suggested vLLM Command

```bash
vllm serve /models/Qwen2.5-3B-Instruct-NVFP4 \
  --trust-remote-code \
  --served-model-name Qwen2.5-3B \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.93 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 4 \
  --tensor-parallel-size 1 \
  --enforce-eager \
  --port 8000
```

## Status

- [x] Quantized to NVFP4A16 (compressed-tensors)
- [x] Tested on RTX 5060 Ti 16 GB
- [x] Tested with vLLM v0.22.0 Docker image
- [x] Loads successfully in vLLM