Qwen2.5-7B-Instruct-NVFP4 / README.md

murilonwt

Upload folder using huggingface_hub

76a4bb1 verified 5 days ago

preview code

raw

history blame contribute delete

1.8 kB

metadata

license: apache-2.0
base_model: Qwen/Qwen2.5-7B-Instruct
library_name: vllm
pipeline_tag: text-generation
tags:
  - qwen
  - qwen2.5
  - nvfp4
  - nvfp4a16
  - compressed-tensors
  - vllm
  - blackwell
  - rtx-50-series
  - flashattention

Qwen2.5-7B-Instruct NVFP4A16 — vLLM / RTX 5060 Ti

This repository contains a locally quantized Qwen2.5-7B-Instruct model using NVFP4A16 / compressed-tensors for vLLM inference on NVIDIA Blackwell GPUs.

All Linear layers (except lm_head) were quantized to NVFP4A16 using llmcompressor + compressed-tensors.

This model card documents the local quantization and test performed on RTX 5060 Ti 16GB.

Quantization Summary

Item	Value
Base model	Qwen/Qwen2.5-7B-Instruct
Architecture	`Qwen2ForCausalLM`
Hidden size	3584
Layers	28
Attention heads	28 (KV: 4)
Context length	32,768
Vocab size	152,064
Quantization format	NVFP4A16
Compressed size	~5.5 GB
Quantization config	`compressed-tensors`

Tested Hardware

Component	Configuration
GPU	NVIDIA GeForce RTX 5060 Ti 16 GB
CPU	Intel Xeon E5-2680 v4
System RAM	64 GB
Runtime	Docker + NVIDIA Container Runtime
Container image	`vllm/vllm-openai:v0.22.0-ubuntu2404`

Suggested vLLM Command

vllm serve /models/Qwen2.5-7B-Instruct-NVFP4 \
  --trust-remote-code \
  --served-model-name Qwen2.5-7B \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.93 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 4 \
  --tensor-parallel-size 1 \
  --enforce-eager \
  --port 8000

Status

Quantized to NVFP4A16 (compressed-tensors)
Tested on RTX 5060 Ti 16 GB
Tested with vLLM v0.22.0 Docker image
Loads successfully in vLLM