omniASR LLM 7B v2 — NF4 Quantized

Quantized checkpoint for omniASR LLM 7B v2 using bitsandbytes NF4 (4-bit).

No need to download the original 29 GB model. Self-contained full-module pickle — load directly with torch.load().

Available checkpoint

File	Quant	Size	VRAM (idle)	VRAM (peak)	RTF (batch 8)
`omniASR_LLM_7B_v2_nf4_full.pt`	NF4 (4-bit)	3.9 GB	~4 GB	~12 GB	0.23

Usage

With omniASR-server (Docker — recommended)

git clone https://github.com/mufradhossain/omniASR-server.git
cd omniASR-server

# Download this checkpoint
mkdir -p checkpoints
# Place omniASR_LLM_7B_v2_nf4_full.pt in ./checkpoints/

# Start the server
docker compose up -d

Server loads in ~32 seconds. Serves at http://localhost:8000.

Standalone

import torch
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline, load_tokenizer

model = torch.load("omniASR_LLM_7B_v2_nf4_full.pt", map_location="cpu")

# Restore quantized layers to GPU (fix for bitsandbytes pickle)
device = torch.device("cuda")
from bitsandbytes.nn import LinearNF4, Linear8bitLt
for m in model.modules():
    if not isinstance(m, (LinearNF4, Linear8bitLt)):
        for p in m.parameters(recurse=False): p.data = p.data.to(device)
        for b in m.buffers(recurse=False): b.data = b.data.to(device)
    else:
        m.weight.data = m.weight.data.to(device)
        if hasattr(m.weight, 'quant_state') and m.weight.quant_state:
            m.weight.quant_state = m.weight.quant_state.to(device)
        if m.bias is not None:
            m.bias.data = m.bias.data.to(device)

tokenizer = load_tokenizer("omniASR_LLM_7B_v2")
pipeline = ASRInferencePipeline(model_card=None, model=model, tokenizer=tokenizer, device=device)

result = pipeline.transcribe(["audio.wav"], lang=["eng_Latn"], batch_size=8)
print(result[0])

Quantization details

NF4: 4-bit NormalFloat4 via bitsandbytes.nn.LinearNF4
Streaming quantization: each layer loaded from mmap -> quantized -> moved to GPU -> CPU memory freed
Full-module pickle with quant state restoration on load