README.md · murilonwt/Qwen2.5-3B-Instruct-NVFP4 at main

Qwen2.5-3B-Instruct-NVFP4 / README.md

murilonwt

Upload folder using huggingface_hub

ca7eb84 verified 5 days ago

preview code

raw

history blame contribute delete

1.8 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen2.5-3B-Instruct
	library_name: vllm
	pipeline_tag: text-generation
	tags:
	- qwen
	- qwen2.5
	- nvfp4
	- nvfp4a16
	- compressed-tensors
	- vllm
	- blackwell
	- rtx-50-series
	- flashattention
	---

	# Qwen2.5-3B-Instruct NVFP4A16 — vLLM / RTX 5060 Ti

	This repository contains a locally quantized Qwen2.5-3B-Instruct model using NVFP4A16 / compressed-tensors for vLLM inference on NVIDIA Blackwell GPUs.

	All Linear layers (except `lm_head`) were quantized to NVFP4A16 using `llmcompressor` + `compressed-tensors`.

	> This model card documents the local quantization and test performed on RTX 5060 Ti 16GB.

	## Quantization Summary

	\| Item \| Value \|
	\|---\|---\|
	\| Base model \| Qwen/Qwen2.5-3B-Instruct \|
	\| Architecture \| `Qwen2ForCausalLM` \|
	\| Hidden size \| 2048 \|
	\| Layers \| 36 \|
	\| Attention heads \| 16 (KV: 2) \|
	\| Context length \| 32,768 \|
	\| Vocab size \| 151,936 \|
	\| Quantization format \| NVFP4A16 \|
	\| Compressed size \| ~2.7 GB \|
	\| Quantization config \| `compressed-tensors` \|

	## Tested Hardware

	\| Component \| Configuration \|
	\|---\|---\|
	\| GPU \| NVIDIA GeForce RTX 5060 Ti 16 GB \|
	\| CPU \| Intel Xeon E5-2680 v4 \|
	\| System RAM \| 64 GB \|
	\| Runtime \| Docker + NVIDIA Container Runtime \|
	\| Container image \| `vllm/vllm-openai:v0.22.0-ubuntu2404` \|

	## Suggested vLLM Command

	```bash
	vllm serve /models/Qwen2.5-3B-Instruct-NVFP4 \
	--trust-remote-code \
	--served-model-name Qwen2.5-3B \
	--max-model-len 32768 \
	--gpu-memory-utilization 0.93 \
	--max-num-batched-tokens 8192 \
	--max-num-seqs 4 \
	--tensor-parallel-size 1 \
	--enforce-eager \
	--port 8000
	```

	## Status

	- [x] Quantized to NVFP4A16 (compressed-tensors)
	- [x] Tested on RTX 5060 Ti 16 GB
	- [x] Tested with vLLM v0.22.0 Docker image
	- [x] Loads successfully in vLLM