Velvet-14B (NVFP4 Quantization)

This repository contains a quantized version of the Almawave/Velvet-14B model. The model has been quantized to NVFP4 (NVIDIA FP4) format using the NVIDIA TensorRT-Model-Optimizer.

This format is optimized for next-generation NVIDIA hardware (Blackwell/Hopper architectures) but can be served efficiently using vLLM.

Model Details

Base Model: Almawave/Velvet-14B
Quantization Format: NVFP4 (Blockwise scaling)
Tool Used: NVIDIA TensorRT-Model-Optimizer (v0.35.0)
Optimization Target: Low VRAM usage while maintaining high accuracy for Italian/English tasks.

Usage with vLLM

You can serve this model directly using vLLM (version 0.6.3 or newer recommended for NVFP4 support).

Docker Command

docker run --rm -it \
    --runtime nvidia \
    --gpus all \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model "YOUR_USERNAME/Velvet-14B-nvfp4" \
    --trust-remote-code

(Note: Depending on your vLLM version and hardware, the quantization parameter might be auto-detected. If nvfp4 specific loading fails, fp8 or removing the quantization flag often works as a fallback wrapper).

Performance & Hardware

This quantization significantly reduces memory footprint compared to the BF16 original model, making it suitable for high-throughput serving on enterprise GPUs.

Credits

Original model by Almawave. Quantized by trithemius.

Downloads last month: 5

Safetensors

Model size

8B params

Tensor type

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for trithemius/Velvet-14B-nvfp4

Base model

Almawave/Velvet-14B

Quantized

(8)

this model