omniASR LLM 7B v2 β€” NF4 Quantized

Quantized checkpoint for omniASR LLM 7B v2 using bitsandbytes NF4 (4-bit).

No need to download the original 29 GB model. Self-contained full-module pickle β€” load directly with torch.load().

Available checkpoint

File Quant Size VRAM (idle) VRAM (peak) RTF (batch 8)
omniASR_LLM_7B_v2_nf4_full.pt NF4 (4-bit) 3.9 GB ~4 GB ~12 GB 0.23

Usage

With omniASR-server (Docker β€” recommended)

git clone https://github.com/mufradhossain/omniASR-server.git
cd omniASR-server

# Download this checkpoint
mkdir -p checkpoints
# Place omniASR_LLM_7B_v2_nf4_full.pt in ./checkpoints/

# Start the server
docker compose up -d

Server loads in ~32 seconds. Serves at http://localhost:8000.

Standalone

import torch
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline, load_tokenizer

model = torch.load("omniASR_LLM_7B_v2_nf4_full.pt", map_location="cpu")

# Restore quantized layers to GPU (fix for bitsandbytes pickle)
device = torch.device("cuda")
from bitsandbytes.nn import LinearNF4, Linear8bitLt
for m in model.modules():
    if not isinstance(m, (LinearNF4, Linear8bitLt)):
        for p in m.parameters(recurse=False): p.data = p.data.to(device)
        for b in m.buffers(recurse=False): b.data = b.data.to(device)
    else:
        m.weight.data = m.weight.data.to(device)
        if hasattr(m.weight, 'quant_state') and m.weight.quant_state:
            m.weight.quant_state = m.weight.quant_state.to(device)
        if m.bias is not None:
            m.bias.data = m.bias.data.to(device)

tokenizer = load_tokenizer("omniASR_LLM_7B_v2")
pipeline = ASRInferencePipeline(model_card=None, model=model, tokenizer=tokenizer, device=device)

result = pipeline.transcribe(["audio.wav"], lang=["eng_Latn"], batch_size=8)
print(result[0])

Quantization details

  • NF4: 4-bit NormalFloat4 via bitsandbytes.nn.LinearNF4
  • Streaming quantization: each layer loaded from mmap -> quantized -> moved to GPU -> CPU memory freed
  • Full-module pickle with quant state restoration on load

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support