facebook_mms-1b-all-NVFP4

NVFP4 (NVFP4, W4A4) post-training quantization of facebook/mms-1b-all — architecture: w2v2_ctc.

Format: nvfp4-pack-quantized (compressed-tensors). 4-bit FP4 weights, per-block FP8 (E4M3) scales, per-tensor FP32 global scales; activations dynamically quantized to FP4.
Calibration: 32 Persian clips from Reza2kn/persian-asr-eval-v0 (held out from the WER eval set).
Hardware target: NVIDIA Blackwell tensor cores (sm_100+). Quantized on RTX 5080 Laptop (sm_120).
Quantized layers: all Linear modules in the encoder/decoder (CTC lm_head / proj_out left full precision).

Eval — `Reza2kn/persian-asr-eval-v0` (FLEURS-fa)

Variant	WER ↓	CER ↓	clips	per-clip latency	peak VRAM
NVFP4 (this repo)	17.68%	4.69%	200	292 ms	3403 MiB

Persian text normalization for WER/CER: NFKC, ZWNJ → space, ي→ی / ك→ک, digit folding, punctuation stripping, whitespace collapse.

Usage

import torch
import soundfile as sf
from transformers import AutoProcessor, AutoModel

repo = "Reza2kn/facebook_mms-1b-all-NVFP4"
processor = AutoProcessor.from_pretrained(repo)
# Load in bfloat16 — NVFP4 weights decompress to bf16 inside CompressedLinear.
model = AutoModel.from_pretrained(repo, dtype=torch.bfloat16).to("cuda").eval()

(See the original facebook/mms-1b-all model card for arch-specific decoding boilerplate.)

How it was made

llmcompressor QuantizationModifier(targets=["Linear"], scheme="NVFP4", ignore=...) → compressed-tensors nvfp4-pack-quantized checkpoint.

License

Inherits the base model's license.

Downloads last month: 4

Safetensors

Model size

0.6B params

Tensor type

F32

F8_E4M3

Model tree for Reza2kn/facebook_mms-1b-all-NVFP4

Base model

facebook/mms-1b-all

Quantized

(4)