| --- |
| license: apache-2.0 |
| base_model: Qwen/Qwen2.5-7B-Instruct |
| library_name: vllm |
| pipeline_tag: text-generation |
| tags: |
| - qwen |
| - qwen2.5 |
| - nvfp4 |
| - nvfp4a16 |
| - compressed-tensors |
| - vllm |
| - blackwell |
| - rtx-50-series |
| - flashattention |
| --- |
| |
| # Qwen2.5-7B-Instruct NVFP4A16 — vLLM / RTX 5060 Ti |
|
|
| This repository contains a locally quantized **Qwen2.5-7B-Instruct** model using **NVFP4A16 / compressed-tensors** for vLLM inference on NVIDIA Blackwell GPUs. |
|
|
| All Linear layers (except `lm_head`) were quantized to NVFP4A16 using `llmcompressor` + `compressed-tensors`. |
|
|
| > This model card documents the local quantization and test performed on RTX 5060 Ti 16GB. |
|
|
| ## Quantization Summary |
|
|
| | Item | Value | |
| |---|---| |
| | Base model | Qwen/Qwen2.5-7B-Instruct | |
| | Architecture | `Qwen2ForCausalLM` | |
| | Hidden size | 3584 | |
| | Layers | 28 | |
| | Attention heads | 28 (KV: 4) | |
| | Context length | 32,768 | |
| | Vocab size | 152,064 | |
| | Quantization format | NVFP4A16 | |
| | Compressed size | ~5.5 GB | |
| | Quantization config | `compressed-tensors` | |
|
|
| ## Tested Hardware |
|
|
| | Component | Configuration | |
| |---|---| |
| | GPU | NVIDIA GeForce RTX 5060 Ti 16 GB | |
| | CPU | Intel Xeon E5-2680 v4 | |
| | System RAM | 64 GB | |
| | Runtime | Docker + NVIDIA Container Runtime | |
| | Container image | `vllm/vllm-openai:v0.22.0-ubuntu2404` | |
|
|
| ## Suggested vLLM Command |
|
|
| ```bash |
| vllm serve /models/Qwen2.5-7B-Instruct-NVFP4 \ |
| --trust-remote-code \ |
| --served-model-name Qwen2.5-7B \ |
| --max-model-len 32768 \ |
| --gpu-memory-utilization 0.93 \ |
| --max-num-batched-tokens 8192 \ |
| --max-num-seqs 4 \ |
| --tensor-parallel-size 1 \ |
| --enforce-eager \ |
| --port 8000 |
| ``` |
|
|
| ## Status |
|
|
| - [x] Quantized to NVFP4A16 (compressed-tensors) |
| - [x] Tested on RTX 5060 Ti 16 GB |
| - [x] Tested with vLLM v0.22.0 Docker image |
| - [x] Loads successfully in vLLM |
|
|