Velvet-14B (NVFP4 Quantization)
This repository contains a quantized version of the Almawave/Velvet-14B model. The model has been quantized to NVFP4 (NVIDIA FP4) format using the NVIDIA TensorRT-Model-Optimizer.
This format is optimized for next-generation NVIDIA hardware (Blackwell/Hopper architectures) but can be served efficiently using vLLM.
Model Details
- Base Model: Almawave/Velvet-14B
- Quantization Format: NVFP4 (Blockwise scaling)
- Tool Used: NVIDIA TensorRT-Model-Optimizer (v0.35.0)
- Optimization Target: Low VRAM usage while maintaining high accuracy for Italian/English tasks.
Usage with vLLM
You can serve this model directly using vLLM (version 0.6.3 or newer recommended for NVFP4 support).
Docker Command
docker run --rm -it \
--runtime nvidia \
--gpus all \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model "YOUR_USERNAME/Velvet-14B-nvfp4" \
--trust-remote-code
(Note: Depending on your vLLM version and hardware, the quantization parameter might be auto-detected. If nvfp4 specific loading fails, fp8 or removing the quantization flag often works as a fallback wrapper).
Performance & Hardware
This quantization significantly reduces memory footprint compared to the BF16 original model, making it suitable for high-throughput serving on enterprise GPUs.
Credits
Original model by Almawave. Quantized by trithemius.
- Downloads last month
- 17
Model tree for trithemius/Velvet-14B-nvfp4
Base model
Almawave/Velvet-14B