omniASR LLM 7B v2 β NF4 Quantized
Quantized checkpoint for omniASR LLM 7B v2 using bitsandbytes NF4 (4-bit).
No need to download the original 29 GB model. Self-contained full-module pickle β load directly with torch.load().
Available checkpoint
| File | Quant | Size | VRAM (idle) | VRAM (peak) | RTF (batch 8) |
|---|---|---|---|---|---|
omniASR_LLM_7B_v2_nf4_full.pt |
NF4 (4-bit) | 3.9 GB | ~4 GB | ~12 GB | 0.23 |
Usage
With omniASR-server (Docker β recommended)
git clone https://github.com/mufradhossain/omniASR-server.git
cd omniASR-server
# Download this checkpoint
mkdir -p checkpoints
# Place omniASR_LLM_7B_v2_nf4_full.pt in ./checkpoints/
# Start the server
docker compose up -d
Server loads in ~32 seconds. Serves at http://localhost:8000.
Standalone
import torch
from omnilingual_asr.models.inference.pipeline import ASRInferencePipeline, load_tokenizer
model = torch.load("omniASR_LLM_7B_v2_nf4_full.pt", map_location="cpu")
# Restore quantized layers to GPU (fix for bitsandbytes pickle)
device = torch.device("cuda")
from bitsandbytes.nn import LinearNF4, Linear8bitLt
for m in model.modules():
if not isinstance(m, (LinearNF4, Linear8bitLt)):
for p in m.parameters(recurse=False): p.data = p.data.to(device)
for b in m.buffers(recurse=False): b.data = b.data.to(device)
else:
m.weight.data = m.weight.data.to(device)
if hasattr(m.weight, 'quant_state') and m.weight.quant_state:
m.weight.quant_state = m.weight.quant_state.to(device)
if m.bias is not None:
m.bias.data = m.bias.data.to(device)
tokenizer = load_tokenizer("omniASR_LLM_7B_v2")
pipeline = ASRInferencePipeline(model_card=None, model=model, tokenizer=tokenizer, device=device)
result = pipeline.transcribe(["audio.wav"], lang=["eng_Latn"], batch_size=8)
print(result[0])
Quantization details
- NF4: 4-bit NormalFloat4 via
bitsandbytes.nn.LinearNF4 - Streaming quantization: each layer loaded from mmap -> quantized -> moved to GPU -> CPU memory freed
- Full-module pickle with quant state restoration on load
Links
- Original model: ARahim3/omniASR-LLM-7B-v2
- Server code: github.com/mufradhossain/omniASR-server