Velvet-14B (NVFP4 Quantization)

This repository contains a quantized version of the Almawave/Velvet-14B model. The model has been quantized to NVFP4 (NVIDIA FP4) format using the NVIDIA TensorRT-Model-Optimizer.

This format is optimized for next-generation NVIDIA hardware (Blackwell/Hopper architectures) but can be served efficiently using vLLM.

Model Details

  • Base Model: Almawave/Velvet-14B
  • Quantization Format: NVFP4 (Blockwise scaling)
  • Tool Used: NVIDIA TensorRT-Model-Optimizer (v0.35.0)
  • Optimization Target: Low VRAM usage while maintaining high accuracy for Italian/English tasks.

Usage with vLLM

You can serve this model directly using vLLM (version 0.6.3 or newer recommended for NVFP4 support).

Docker Command

docker run --rm -it \
    --runtime nvidia \
    --gpus all \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model "YOUR_USERNAME/Velvet-14B-nvfp4" \
    --trust-remote-code

(Note: Depending on your vLLM version and hardware, the quantization parameter might be auto-detected. If nvfp4 specific loading fails, fp8 or removing the quantization flag often works as a fallback wrapper).

Performance & Hardware

This quantization significantly reduces memory footprint compared to the BF16 original model, making it suitable for high-throughput serving on enterprise GPUs.

Credits

Original model by Almawave. Quantized by trithemius.

Downloads last month
17
Safetensors
Model size
8B params
Tensor type
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for trithemius/Velvet-14B-nvfp4

Quantized
(8)
this model