Single-file Marlin INT4: direct RTN quantization, no GPTQ intermediate

Replace 2-file GPTQ+Marlin format with single consolidated.safetensors.

- Single-step BF16 → RTN → Marlin pack (no intermediate GPTQ, scales computed once)
- 4.07 GB single file (was 5.8 GB across two files)
- Python server: remove DequantLinear/MarlinLinear, add PrepackedMarlinLinear
- Add quantize_marlin.py for reproducibility
- Tested on Jetson Orin Nano: identical transcription, 15.2 tok/s

Files changed (5) hide show

README.md +29 -28
consolidated.safetensors +2 -2
params.json +4 -4
scripts/jetson_serve_sdpa.py +25 -265
scripts/quantize_marlin.py +266 -0

README.md CHANGED Viewed

@@ -8,6 +8,7 @@ tags:
   - mistral
   - int4
   - quantized
   - jetson
   - edge
   - realtime
@@ -32,35 +33,49 @@ language:
 INT4 quantized [Voxtral Mini 4B Realtime](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602) for edge deployment on NVIDIA Jetson Orin Nano (8 GB).
-**4.1 GB total** — fits in 8 GB unified memory with room for KV cache and runtime.
 ## What's in this repo
 | File | Size | Description |
 |------|------|-------------|
-| `consolidated.safetensors` | 4.1 GB | INT4 GPTQ-packed weights (encoder fp16 + decoder int4) |
 | `params.json` | 1.6 KB | Model architecture config (Mistral native format) |
 | `tekken.json` | 15 MB | Mistral tekken tokenizer |
-| `scripts/jetson_serve_sdpa.py` | 53 KB | Self-contained inference server (no HF/vLLM deps) |
 | `kernels/fused_ops.cu` | 8.5 KB | Fused CUDA kernels (JIT compiled, SM87) |
 ## Quantization details
-- **Method**: RTN (Round-To-Nearest) with INT4 packing in GPTQ format
-- **Bits**: 4-bit (decoder linear layers), fp16 (audio encoder, norms, embeddings)
 - **Group size**: 128
-- **Encoding**: uint4b8 (value + 8 bias), compatible with Marlin fused INT4 kernel
-- **Why RTN over GPTQ**: GPTQ's Hessian optimization destroys the critical SPAD-to-text transition boundary in Voxtral's streaming architecture. RTN preserves it perfectly. See the [full quantization report](https://huggingface.co/Teaspoon-AI/Voxtral-Mini-4B-INT4-Jetson/blob/main/README.md#why-rtn-not-gptq) below.
 ## Architecture
 | Component | Params | Precision | Size |
 |-----------|--------|-----------|------|
-| Audio encoder (Whisper-style, 32 layers) | ~600M | fp16 | 1.86 GB |
-| Projector (5120 → 3072 → 3072) | ~25M | fp16 | 0.05 GB |
-| LM decoder (26 layers, 3072 hidden, GQA 32/8 heads) | ~3B | INT4 | ~2.2 GB |
-| ada_rms_norm_t_cond (52 tensors) | ~1M | fp16 | 0.01 GB |
-| **Total** | **~3.6B** | | **4.1 GB** |
 ## Transcription quality
@@ -81,7 +96,7 @@ Tested on Fleurs en_us samples — near-perfect output matching the fp16 baselin
 No HuggingFace or vLLM dependencies needed. Runs inside the [PyTorch Jetson container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/l4t-pytorch).
 ```bash
-pip install safetensors websockets soundfile numpy librosa
 # Test with an audio file
 python scripts/jetson_serve_sdpa.py --test audio.wav
@@ -98,21 +113,6 @@ The server exposes `ws://localhost:8000/v1/realtime` for streaming transcription
 - Pre-allocated KV cache (eliminates per-token torch.cat)
 - Fused CUDA kernels for RMSNorm, RoPE, SiLU·Mul (~500 kernel launches/token → ~80)
-### Option 2: vLLM serving
-```bash
-pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly --pre
-pip install librosa soxr
-python -m vllm.entrypoints.openai.api_server \
-    --model /path/to/this/repo \
-    --tokenizer-mode mistral --config-format mistral --load-format mistral \
-    --max-model-len 8192 --dtype float16 --enforce-eager \
-    --gpu-memory-utilization 0.5
-```
-**Note**: Requires vLLM nightly (>=0.15.2dev) for `/v1/realtime` WebSocket support.
 ### WebSocket client example
 ```python
@@ -165,4 +165,5 @@ GPTQ quantization fails on this model at every bit precision (4-bit and 8-bit) w
 ## Credits
 - Base model: [Voxtral Mini 4B Realtime](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602) by Mistral AI
 - Quantization and Jetson optimization by [Teaspoon AI](https://huggingface.co/Teaspoon-AI)

   - mistral
   - int4
   - quantized
+  - marlin
   - jetson
   - edge
   - realtime
 INT4 quantized [Voxtral Mini 4B Realtime](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602) for edge deployment on NVIDIA Jetson Orin Nano (8 GB).
+**4.1 GB single file** — fits in 8 GB unified memory with room for KV cache and runtime.
 ## What's in this repo
 | File | Size | Description |
 |------|------|-------------|
+| `consolidated.safetensors` | 4.1 GB | Marlin-packed INT4 decoder + BF16 encoder/norms/embeddings |
 | `params.json` | 1.6 KB | Model architecture config (Mistral native format) |
 | `tekken.json` | 15 MB | Mistral tekken tokenizer |
+| `scripts/jetson_serve_sdpa.py` | ~50 KB | Self-contained inference server (no HF/vLLM deps) |
+| `scripts/quantize_marlin.py` | ~6 KB | Quantization script to reproduce this model |
 | `kernels/fused_ops.cu` | 8.5 KB | Fused CUDA kernels (JIT compiled, SM87) |
 ## Quantization details
+- **Method**: RTN (Round-To-Nearest) quantized directly into Marlin-packed format
+- **Bits**: 4-bit (decoder linear layers), BF16 (audio encoder, norms, embeddings)
 - **Group size**: 128
+- **Encoding**: uint4b8 (value + 8 bias), Marlin tiled INT4 layout
+- **Single step**: BF16 → RTN quantize → Marlin pack (no intermediate GPTQ format, scales computed once)
+- **Why RTN over GPTQ**: GPTQ's Hessian optimization destroys the critical SPAD-to-text transition boundary in Voxtral's streaming architecture. RTN preserves it perfectly. See [below](#why-rtn-not-gptq).
+### Reproducing the quantization
+```bash
+pip install torch safetensors numpy
+# From the original HuggingFace model:
+python scripts/quantize_marlin.py \
+    --model-dir path/to/Voxtral-Mini-4B-Realtime-2602 \
+    --output-dir ./output
+```
 ## Architecture
 | Component | Params | Precision | Size |
 |-----------|--------|-----------|------|
+| Audio encoder (Whisper-style, 32 layers) | ~600M | BF16 | 1.86 GB |
+| Projector (5120 → 3072 → 3072) | ~25M | BF16 | 0.05 GB |
+| LM decoder (26 layers, 3072 hidden, GQA 32/8 heads) | ~3B | Marlin INT4 | ~1.58 GB |
+| Token embeddings (131072 × 3072) | ~400M | BF16 | 0.77 GB |
+| ada_rms_norm_t_cond + norms | ~1M | BF16 | 0.01 GB |
+| **Total** | **~4B** | | **4.1 GB** |
 ## Transcription quality
 No HuggingFace or vLLM dependencies needed. Runs inside the [PyTorch Jetson container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/l4t-pytorch).
 ```bash
+pip install safetensors websockets soundfile numpy librosa marlin
 # Test with an audio file
 python scripts/jetson_serve_sdpa.py --test audio.wav
 - Pre-allocated KV cache (eliminates per-token torch.cat)
 - Fused CUDA kernels for RMSNorm, RoPE, SiLU·Mul (~500 kernel launches/token → ~80)
 ### WebSocket client example
 ```python
 ## Credits
 - Base model: [Voxtral Mini 4B Realtime](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602) by Mistral AI
+- Marlin INT4 kernel: [IST-DASLab/marlin](https://github.com/IST-DASLab/marlin) (Apache 2.0)
 - Quantization and Jetson optimization by [Teaspoon AI](https://huggingface.co/Teaspoon-AI)

consolidated.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:fd25f9d675042c37b0a9b051a5333ef001c129d302461995bdc3e7b321c3b2b6
-size 4382321392

 version https://git-lfs.github.com/spec/v1
+oid sha256:a88ff5bf15ec7ca11fc7b0ff51148721dcca585f7c356baa2576eee785250d44
+size 4367478888

params.json CHANGED Viewed

@@ -54,12 +54,12 @@
   "ada_rms_norm_t_cond": true,
   "ada_rms_norm_t_cond_dim": 32,
   "quantization_config": {
-    "quant_method": "gptq",
     "bits": 4,
     "group_size": 128,
-    "desc_act": false,
     "sym": true,
-    "checkpoint_format": "gptq",
-    "pack_dtype": "int32"
   }
 }

   "ada_rms_norm_t_cond": true,
   "ada_rms_norm_t_cond_dim": 32,
   "quantization_config": {
+    "quant_method": "rtn",
     "bits": 4,
     "group_size": 128,
     "sym": true,
+    "checkpoint_format": "marlin",
+    "pack_dtype": "int32",
+    "encoding": "uint4b8"
   }
 }

scripts/jetson_serve_sdpa.py CHANGED Viewed

@@ -1,7 +1,7 @@
 #!/usr/bin/env python3
-"""Voxtral Mini 4B Realtime — Jetson Orin Nano 8GB inference server.
-Loads INT4-packed GPTQ weights from Mistral native format and serves
 transcription via WebSocket at ws://localhost:8000/v1/realtime.
 Key architecture detail: at each decoder position, the input embedding is
@@ -55,6 +55,7 @@ try:
     HAS_MARLIN = True
 except ImportError:
     HAS_MARLIN = False
 # Try to JIT-compile fused CUDA kernels (collapses ~500 kernel launches/token to ~80)
 HAS_FUSED = False
@@ -103,54 +104,24 @@ DOWNSAMPLE_FACTOR = 4
 # ─── Marlin Fused INT4 Linear ────────────────────────────────────────────────
-class MarlinLinear(nn.Module):
-    """Linear layer using Marlin fused INT4 dequant+matmul CUDA kernel.
-    Repacks GPTQ INT4 weights into Marlin's optimized format at construction time.
-    Forward pass is a single fused kernel call — ~50x faster than on-the-fly dequant.
-    Memory footprint is identical to GPTQ INT4 (no extra memory needed).
     """
-    def __init__(self, qweight, scales, qzeros, unpermute=None):
         super().__init__()
-        in_features = qweight.shape[0] * PACK_FACTOR
-        out_features = qweight.shape[1]
-        n_groups = scales.shape[0]
-        self.in_features = in_features
-        self.out_features = out_features
-        # Dequantize GPTQ → fp16, then repack into Marlin format
-        shifts = torch.arange(0, 32, BITS, device=qweight.device, dtype=torch.int32)
-        unpacked = (qweight.unsqueeze(0) >> shifts.view(-1, 1, 1)) & 0xF
-        unpacked = unpacked.permute(1, 0, 2).reshape(in_features, out_features)
-        unpacked = unpacked.T.reshape(out_features, n_groups, GROUP_SIZE)
-        s = scales.T.float().unsqueeze(-1)
-        w_fp16 = ((unpacked.float() - BIAS) * s).reshape(out_features, in_features).half()
-        del unpacked, s
-        if unpermute is not None:
-            n_heads, hidden_size = unpermute
-            head_dim = w_fp16.shape[0] // n_heads
-            w_fp16 = (w_fp16.view(n_heads, 2, head_dim // 2, hidden_size)
-                      .transpose(1, 2)
-                      .reshape(out_features, in_features))
-        # Create temporary nn.Linear for Marlin's pack()
-        linear = nn.Linear(in_features, out_features, bias=False,
-                           dtype=torch.half, device=qweight.device)
-        linear.weight.data = w_fp16
-        # Create Marlin layer and pack (handles permutation + bit packing)
-        ml = _marlin.Layer(in_features, out_features, groupsize=GROUP_SIZE)
-        ml.pack(linear, scales.T)
-        del linear, w_fp16
-        # Store Marlin buffers
-        self.register_buffer('B', ml.B.to(qweight.device))
-        self.register_buffer('s', ml.s.to(qweight.device))
         self.register_buffer('workspace',
-                             torch.zeros(out_features // 128 * 16,
-                                         dtype=torch.int, device=qweight.device),
                              persistent=False)
     def forward(self, x):
@@ -161,85 +132,6 @@ class MarlinLinear(nn.Module):
         return C
-# ─── GPTQ INT4 Dequantization (fallback when Marlin unavailable) ────────────
-class DequantLinear(nn.Module):
-    """Linear layer with INT4 GPTQ packed weights.
-    Supports two modes:
-    - On-the-fly dequantization (default): dequantizes each forward call
-    - Cached mode: stores pre-dequantized fp16 weight for fast matmul
-    """
-    _shifts = None  # class-level cached shifts tensor
-    def __init__(self, qweight, scales, qzeros, unpermute=None):
-        super().__init__()
-        self.register_buffer('qweight', qweight)
-        self.register_buffer('scales', scales)
-        self.register_buffer('qzeros', qzeros)
-        self.in_features = qweight.shape[0] * PACK_FACTOR
-        self.out_features = qweight.shape[1]
-        self.unpermute = unpermute
-        self._cached_w = None  # pre-dequantized fp16 weight [out, in]
-    def cache_weight(self, free_int4=True):
-        """Pre-dequantize and cache the fp16 weight.
-        If free_int4=True, frees INT4 buffers (saves memory, not reversible).
-        """
-        self._cached_w = self._dequantize()
-        if free_int4:
-            self.qweight = None
-            self.scales = None
-            self.qzeros = None
-    def uncache_weight(self):
-        """Free the cached weight (e.g., before re-loading INT4 weights)."""
-        self._cached_w = None
-    @property
-    def cached_bytes(self):
-        """Memory used by cached weight in bytes."""
-        if self._cached_w is not None:
-            return self._cached_w.nelement() * self._cached_w.element_size()
-        return 0
-    def _dequantize(self):
-        """Dequantize INT4 packed weights to fp16 [out, in]."""
-        qw = self.qweight
-        in_packed, out = qw.shape
-        n_groups = self.scales.shape[0]
-        # Cached shifts tensor (shared across all instances)
-        if DequantLinear._shifts is None or DequantLinear._shifts.device != qw.device:
-            DequantLinear._shifts = torch.arange(0, 32, BITS, device=qw.device, dtype=torch.int32)
-        shifts = DequantLinear._shifts
-        # Vectorized unpack: [8, in/8, out]
-        unpacked = (qw.unsqueeze(0) >> shifts.view(-1, 1, 1)) & 0xF
-        # Interleave to [in, out] then transpose+group to [out, groups, GROUP_SIZE]
-        unpacked = unpacked.permute(1, 0, 2).reshape(self.in_features, out)
-        unpacked = unpacked.T.reshape(out, n_groups, GROUP_SIZE)
-        # Scale: (val - 8) * scale
-        s = self.scales.T.float().unsqueeze(-1)
-        w = ((unpacked.float() - BIAS) * s).reshape(out, self.in_features).half()
-        del unpacked, s
-        if self.unpermute is not None:
-            n_heads, hidden_size = self.unpermute
-            head_dim = w.shape[0] // n_heads
-            w = (w.view(n_heads, 2, head_dim // 2, hidden_size)
-                  .transpose(1, 2)
-                  .reshape(out, self.in_features))
-        return w
-    def forward(self, x):
-        if self._cached_w is not None:
-            return F.linear(x, self._cached_w)
-        w = self._dequantize()
-        result = F.linear(x, w)
-        del w
-        return result
 # ─── Building Blocks ─────────────────────────────────────────────────────────
@@ -629,14 +521,13 @@ class VoxtralModel:
         return F.linear(h, self.embed.weight)
     def _dql(self, f, prefix, dev, unpermute=None):
-        qw = f.get_tensor(f'{prefix}.qweight').to(dev)
-        sc = f.get_tensor(f'{prefix}.scales').to(dev)
-        qz = f.get_tensor(f'{prefix}.qzeros').to(dev)
-        in_f = qw.shape[0] * PACK_FACTOR
-        out_f = qw.shape[1]
-        if HAS_MARLIN and in_f % 128 == 0 and out_f % 256 == 0:
-            return MarlinLinear(qw, sc, qz, unpermute=unpermute)
-        return DequantLinear(qw, sc, qz, unpermute=unpermute)
     def _set(self, module, name, tensor):
         """Replace a meta parameter with a real CUDA tensor."""
@@ -761,8 +652,7 @@ class VoxtralModel:
                 print(f"  LM layers {start}-{end-1} loaded")
             self._load_section(path, load_dec_batch)
-        backend = "Marlin fused INT4" if HAS_MARLIN else "DequantLinear"
-        print(f"  LM decoder loaded ({self.n_layers} layers, {backend})")
         gc.collect()
         torch.cuda.empty_cache()
         mem = torch.cuda.memory_allocated() / 1024**3
@@ -785,128 +675,6 @@ class VoxtralModel:
                 self.tokenizer = None
                 print("  WARNING: mistral_common not available, using fallback decoder")
-    def _pre_dequant(self):
-        """Offload encoder to CPU and pre-dequantize decoder weights into GPU cache.
-        After encoding is done, the encoder (~1.86 GB) is no longer needed on GPU.
-        Offloading it frees memory for caching pre-dequantized decoder weights,
-        which eliminates the per-token dequantization overhead.
-        """
-        import gc
-        if hasattr(self, '_decoder_cached') and self._decoder_cached:
-            return  # already cached
-        t0 = time.time()
-        # Move encoder + projector to CPU to free GPU memory
-        self.encoder.cpu()
-        self.projector.cpu()
-        gc.collect()
-        torch.cuda.empty_cache()
-        self._evict_cache()
-        free, _ = torch.cuda.mem_get_info(0)
-        print(f"  After encoder offload: {free/1024**3:.2f} GB free")
-        # Budget: leave 500 MB for KV cache + intermediates
-        budget = free - 500 * 1024 * 1024
-        used_bytes = 0
-        cached_count = 0
-        for i, dl in enumerate(self.layers):
-            projs = [dl.attn.q_proj, dl.attn.k_proj, dl.attn.v_proj, dl.attn.o_proj,
-                     dl.gate_proj, dl.up_proj, dl.down_proj]
-            # Estimate net memory cost (fp16 weight minus freed INT4 buffers)
-            layer_fp16 = sum(
-                p.in_features * p.out_features * 2
-                for p in projs if isinstance(p, DequantLinear) and p._cached_w is None
-            )
-            layer_int4 = sum(
-                p.qweight.nelement() * 4 + p.scales.nelement() * 2 + p.qzeros.nelement() * 4
-                for p in projs
-                if isinstance(p, DequantLinear) and p.qweight is not None
-            )
-            net = layer_fp16 - layer_int4  # net increase in memory
-            if used_bytes + net > budget:
-                break
-            for p in projs:
-                if isinstance(p, DequantLinear) and p._cached_w is None and p.qweight is not None:
-                    p.cache_weight(free_int4=True)
-            used_bytes += net
-            cached_count += 1
-            # Periodic cleanup to keep peak memory low
-            if cached_count % 5 == 0:
-                gc.collect()
-                torch.cuda.empty_cache()
-        gc.collect()
-        torch.cuda.empty_cache()
-        free2, _ = torch.cuda.mem_get_info(0)
-        print(f"  Pre-dequantized {cached_count}/{self.n_layers} layers in {time.time()-t0:.1f}s, "
-              f"{free2/1024**3:.2f} GB free")
-        self._decoder_cached = True
-    def _restore_encoder(self):
-        """Move encoder back to GPU for the next transcription.
-        Frees cached decoder weights first to make room, then reloads
-        INT4 weights for layers that had their buffers freed.
-        """
-        import gc
-        if not hasattr(self, '_decoder_cached') or not self._decoder_cached:
-            return
-        t0 = time.time()
-        # Free cached decoder weights
-        needs_reload = []
-        for i, dl in enumerate(self.layers):
-            for p in [dl.attn.q_proj, dl.attn.k_proj, dl.attn.v_proj, dl.attn.o_proj,
-                      dl.gate_proj, dl.up_proj, dl.down_proj]:
-                if isinstance(p, DequantLinear):
-                    if p._cached_w is not None and p.qweight is None:
-                        needs_reload.append(i)
-                    p.uncache_weight()
-        gc.collect()
-        torch.cuda.empty_cache()
-        self._evict_cache()
-        # Move encoder + projector back to GPU
-        self.encoder.to(self.device)
-        self.projector.to(self.device)
-        # Reload INT4 weights for layers that were freed
-        if needs_reload:
-            needs_reload = sorted(set(needs_reload))
-            path = os.path.join(self.model_path, 'consolidated.safetensors')
-            D = str(self.device)
-            with safe_open(path, framework='pt', device=D) as f:
-                for i in needs_reload:
-                    lp = f'layers.{i}'
-                    dl = self.layers[i]
-                    dl.attn.q_proj = self._dql(f, f'{lp}.self_attn.q_proj', D)
-                    dl.attn.k_proj = self._dql(f, f'{lp}.self_attn.k_proj', D)
-                    dl.attn.v_proj = self._dql(f, f'{lp}.self_attn.v_proj', D)
-                    dl.attn.o_proj = self._dql(f, f'{lp}.self_attn.o_proj', D)
-                    dl.gate_proj = self._dql(f, f'{lp}.mlp.gate_proj', D)
-                    dl.up_proj = self._dql(f, f'{lp}.mlp.up_proj', D)
-                    dl.down_proj = self._dql(f, f'{lp}.mlp.down_proj', D)
-            gc.collect()
-            torch.cuda.empty_cache()
-            print(f"  Reloaded {len(needs_reload)} decoder layers from disk")
-        gc.collect()
-        torch.cuda.empty_cache()
-        self._decoder_cached = False
-        print(f"  Encoder restored in {time.time()-t0:.1f}s")
     def decode_tokens(self, ids):
         if self.tokenizer is not None:
             try:
@@ -950,10 +718,6 @@ class VoxtralModel:
         free, _ = torch.cuda.mem_get_info(0)
         print(f"  CUDA free before inference: {free/1024**3:.2f} GB")
-        # Restore encoder to GPU if it was offloaded (only needed without Marlin)
-        if not HAS_MARLIN:
-            self._restore_encoder()
         # 0. Pad audio for streaming alignment
         audio = self._pad_audio(audio)
         print(f"  padded audio: {len(audio)} samples ({len(audio)/SAMPLE_RATE:.1f}s)")
@@ -985,11 +749,7 @@ class VoxtralModel:
         del enc_ds
         print(f"  adapter: {adapter.shape}")
-        # 5. Offload encoder, pre-dequantize decoder weights (only without Marlin)
-        if not HAS_MARLIN:
-            self._pre_dequant()
-        # 6. Build prompt: [BOS] + [SPAD] * (n_left_pad + delay_tokens)
         prompt_len = 1 + self.n_left_pad + self.delay_tokens
         prompt_ids = [TOKEN_BOS] + [TOKEN_STREAMING_PAD] * (self.n_left_pad + self.delay_tokens)

 #!/usr/bin/env python3
+"""Voxtral Mini 4B Realtime — Jetson Orin Nano inference server.
+Loads Marlin-packed INT4 weights from consolidated.safetensors and serves
 transcription via WebSocket at ws://localhost:8000/v1/realtime.
 Key architecture detail: at each decoder position, the input embedding is
     HAS_MARLIN = True
 except ImportError:
     HAS_MARLIN = False
+    print("WARNING: Marlin not installed. Install with: pip install marlin")
 # Try to JIT-compile fused CUDA kernels (collapses ~500 kernel launches/token to ~80)
 HAS_FUSED = False
 # ─── Marlin Fused INT4 Linear ────────────────────────────────────────────────
+class PrepackedMarlinLinear(nn.Module):
+    """Linear layer using pre-packed Marlin INT4 weights from safetensors.
+    Loads .B and .s tensors directly — no GPTQ→Marlin conversion needed.
+    Used with single-file consolidated.safetensors that already contains
+    Marlin-format weights.
     """
+    def __init__(self, B, s):
         super().__init__()
+        # B: [K//16, 2*N] int32, s: [K//groupsize, N] fp16
+        self.in_features = B.shape[0] * 16
+        self.out_features = B.shape[1] // 2
+        self.register_buffer('B', B)
+        self.register_buffer('s', s)
         self.register_buffer('workspace',
+                             torch.zeros(self.out_features // 128 * 16,
+                                         dtype=torch.int, device=B.device),
                              persistent=False)
     def forward(self, x):
         return C
 # ─── Building Blocks ─────────────────────────────────────────────────────────
         return F.linear(h, self.embed.weight)
     def _dql(self, f, prefix, dev, unpermute=None):
+        B = f.get_tensor(f'{prefix}.B').to(dev)
+        s = f.get_tensor(f'{prefix}.s').to(dev)
+        if not HAS_MARLIN:
+            raise RuntimeError(
+                "Marlin INT4 kernel required but not installed. "
+                "Install with: pip install marlin")
+        return PrepackedMarlinLinear(B, s)
     def _set(self, module, name, tensor):
         """Replace a meta parameter with a real CUDA tensor."""
                 print(f"  LM layers {start}-{end-1} loaded")
             self._load_section(path, load_dec_batch)
+        print(f"  LM decoder loaded ({self.n_layers} layers, Marlin fused INT4)")
         gc.collect()
         torch.cuda.empty_cache()
         mem = torch.cuda.memory_allocated() / 1024**3
                 self.tokenizer = None
                 print("  WARNING: mistral_common not available, using fallback decoder")
     def decode_tokens(self, ids):
         if self.tokenizer is not None:
             try:
         free, _ = torch.cuda.mem_get_info(0)
         print(f"  CUDA free before inference: {free/1024**3:.2f} GB")
         # 0. Pad audio for streaming alignment
         audio = self._pad_audio(audio)
         print(f"  padded audio: {len(audio)} samples ({len(audio)/SAMPLE_RATE:.1f}s)")
         del enc_ds
         print(f"  adapter: {adapter.shape}")
+        # 5. Build prompt: [BOS] + [SPAD] * (n_left_pad + delay_tokens)
         prompt_len = 1 + self.n_left_pad + self.delay_tokens
         prompt_ids = [TOKEN_BOS] + [TOKEN_STREAMING_PAD] * (self.n_left_pad + self.delay_tokens)

scripts/quantize_marlin.py ADDED Viewed

	@@ -0,0 +1,266 @@

+#!/usr/bin/env python3
+"""Single-step BF16 → Marlin INT4 quantization for Voxtral Realtime 4B.
+Produces a single consolidated.safetensors with:
+  - Encoder + adapter + tok_embeddings + norms: BF16 (copied as-is)
+  - Decoder linear weights: Marlin-packed INT4 (group_size=128)
+The decoder linears are RTN-quantized (round-to-nearest, symmetric, per-group)
+and packed directly into Marlin's tiled INT4 format in one step — no intermediate
+GPTQ format, no multiple requantization cycles.
+Why RTN over GPTQ: GPTQ's Hessian optimization destroys the critical SPAD-to-text
+transition boundary in Voxtral's streaming architecture because calibration runs
+through MistralForCausalLM (without ada_rms_norm_t_cond). RTN preserves it.
+Marlin pack logic from IST-DASLab/marlin (Apache 2.0):
+  https://github.com/IST-DASLab/marlin
+Usage:
+    # From original HuggingFace BF16 model:
+    python3 quantize_marlin.py --model-dir path/to/Voxtral-Mini-4B-Realtime-2602
+    # Output (default: ./output/consolidated.safetensors):
+    python3 quantize_marlin.py --model-dir path/to/model --output-dir ./my-output
+Requires: torch, numpy, safetensors
+"""
+import argparse
+import gc
+import json
+import os
+import shutil
+import sys
+import time
+import numpy as np
+import torch
+from safetensors import safe_open
+from safetensors.torch import save_file
+# ─── Model constants ─────────────────────────────────────────────────────────
+N_LAYERS = 26
+N_HEADS = 32
+N_KV_HEADS = 8
+DIM = 3072
+HEAD_DIM = 128
+# ─── Quantization constants ──────────────────────────────────────────────────
+BITS = 4
+GROUP_SIZE = 128
+PACK_FACTOR = 32 // BITS   # 8 int4 values per int32
+BIAS = 1 << (BITS - 1)     # 8 (uint4b8 encoding: stored = value + 8)
+MAXQ = (1 << BITS) - 1     # 15
+# ─── Mistral → HF naming for decoder linears ─────────────────────────────────
+DECODER_LINEARS = {
+    "attention.wq": ("self_attn.q_proj", True,  N_HEADS),     # needs Q/K permute
+    "attention.wk": ("self_attn.k_proj", True,  N_KV_HEADS),  # needs Q/K permute
+    "attention.wv": ("self_attn.v_proj", False, None),
+    "attention.wo": ("self_attn.o_proj", False, None),
+    "feed_forward.w1": ("mlp.gate_proj", False, None),
+    "feed_forward.w2": ("mlp.down_proj", False, None),
+    "feed_forward.w3": ("mlp.up_proj",   False, None),
+}
+# ─── Marlin permutation tables (from IST-DASLab/marlin, Apache 2.0) ─────────
+def _get_perms():
+    perm = []
+    for i in range(32):
+        perm1 = []
+        col = i // 4
+        for block in [0, 1]:
+            for row in [
+                2 * (i % 4),
+                2 * (i % 4) + 1,
+                2 * (i % 4 + 4),
+                2 * (i % 4 + 4) + 1,
+            ]:
+                perm1.append(16 * row + col + 8 * block)
+        for j in range(4):
+            perm.extend([p + 256 * j for p in perm1])
+    perm = np.array(perm)
+    interleave = np.array([0, 2, 4, 6, 1, 3, 5, 7])
+    perm = perm.reshape((-1, 8))[:, interleave].ravel()
+    perm = torch.from_numpy(perm)
+    scale_perm = []
+    for i in range(8):
+        scale_perm.extend([i + 8 * j for j in range(8)])
+    return perm, scale_perm
+_perm, _scale_perm = _get_perms()
+# ─── Q/K head permutation (Mistral → HF interleaving) ────────────────────────
+def permute_qk(w, n_heads, hidden_size):
+    """Apply Mistral→HF head dimension interleaving for Q/K weights."""
+    head_dim = w.shape[0] // n_heads
+    return (
+        w.view(n_heads, head_dim // 2, 2, hidden_size)
+        .transpose(1, 2)
+        .reshape(n_heads * head_dim, hidden_size)
+    )
+# ─── Single-step RTN quantize + Marlin pack ──────────────────────────────────
+def quantize_and_pack_marlin(w_bf16, group_size=GROUP_SIZE):
+    """RTN-quantize a BF16 weight and pack into Marlin format in one step.
+    Args:
+        w_bf16: [N_out, K] BF16/FP16 weight tensor
+    Returns:
+        B: [K//16, 2*N_out] int32 (Marlin-packed weights)
+        s: [K//group_size, N_out] fp16 (Marlin-permuted scales)
+    """
+    N_out, K = w_bf16.shape
+    n_groups = K // group_size
+    tile = 16
+    # ── Step 1: Compute per-group RTN scales ──
+    # Work in [K, N] layout for Marlin packing
+    w = w_bf16.t().float().contiguous()  # [K, N]
+    w_grouped = w.reshape(n_groups, group_size, N_out)
+    max_val = w_grouped.abs().amax(dim=1).clamp(min=1e-10)  # [n_groups, N]
+    scales = (max_val / BIAS).half()  # [n_groups, N] — scale = max_abs / 8
+    # ── Step 2: Quantize to uint4 ──
+    s_expanded = scales.float().unsqueeze(1).expand_as(w_grouped)  # [n_groups, gs, N]
+    w_int = torch.round(w_grouped / s_expanded).clamp(-BIAS, BIAS - 1).int()
+    w_uint = (w_int + BIAS).clamp(0, MAXQ)  # uint4b8: [-8,7] → [0,15]
+    w_uint = w_uint.reshape(K, N_out)  # [K, N]
+    # ── Step 3: Permute scales for Marlin ──
+    s = scales.clone()  # [n_groups, N]
+    s = s.reshape((-1, len(_scale_perm)))[:, _scale_perm]
+    s = s.reshape((-1, N_out)).contiguous()
+    # ── Step 4: Tile into 16×16 blocks ──
+    w_tiled = w_uint.reshape(K // tile, tile, N_out // tile, tile)
+    w_tiled = w_tiled.permute(0, 2, 1, 3)
+    w_tiled = w_tiled.reshape(K // tile, N_out * tile)
+    # ── Step 5: Apply Marlin permutation ──
+    res = w_tiled.reshape((-1, _perm.numel()))[:, _perm].reshape(w_tiled.shape)
+    # ── Step 6: Pack 8 int4 values into each int32 ──
+    q = np.zeros((res.shape[0], res.shape[1] // 8), dtype=np.uint32)
+    res_np = res.cpu().numpy().astype(np.uint32)
+    for i in range(8):
+        q |= res_np[:, i::8] << (4 * i)
+    B = torch.from_numpy(q.astype(np.int32))
+    return B, s.half()
+# ─── Main ────────────────────────────────────────────────────────────────────
+def main():
+    parser = argparse.ArgumentParser(
+        description="Quantize Voxtral BF16 → single-file Marlin INT4")
+    parser.add_argument("--model-dir", required=True,
+                        help="Directory with consolidated.safetensors (BF16, Mistral format)")
+    parser.add_argument("--output-dir", default="./output",
+                        help="Output directory (default: ./output)")
+    args = parser.parse_args()
+    sf_path = os.path.join(args.model_dir, "consolidated.safetensors")
+    if not os.path.exists(sf_path):
+        print(f"Error: {sf_path} not found", file=sys.stderr)
+        sys.exit(1)
+    os.makedirs(args.output_dir, exist_ok=True)
+    output_path = os.path.join(args.output_dir, "consolidated.safetensors")
+    print(f"Input:  {sf_path}")
+    print(f"Output: {output_path}")
+    print(f"Quantization: RTN {BITS}-bit, group_size={GROUP_SIZE}, uint4b8 Marlin")
+    print()
+    sf = safe_open(sf_path, framework="pt", device="cpu")
+    all_keys = list(sf.keys())
+    tensors = {}
+    t0 = time.time()
+    # ── Pass 1: Copy non-decoder-linear tensors as-is ──
+    # These are encoder, adapter, tok_embeddings, norms, ada_rms_norm, final norm
+    decoder_linear_keys = set()
+    for layer_idx in range(N_LAYERS):
+        for mistral_name in DECODER_LINEARS:
+            decoder_linear_keys.add(f"layers.{layer_idx}.{mistral_name}.weight")
+    n_copied = 0
+    for key in all_keys:
+        if key in decoder_linear_keys:
+            continue
+        tensors[key] = sf.get_tensor(key)
+        n_copied += 1
+    print(f"Copied {n_copied} non-linear tensors (encoder, norms, embeddings, etc.)")
+    # ── Pass 2: Quantize decoder linears → Marlin ──
+    n_quantized = 0
+    for layer_idx in range(N_LAYERS):
+        for mistral_name, (hf_name, needs_permute, n_heads) in DECODER_LINEARS.items():
+            src_key = f"layers.{layer_idx}.{mistral_name}.weight"
+            w = sf.get_tensor(src_key).half()  # bf16 → fp16 for torch ops
+            # Apply Q/K head permutation if needed
+            if needs_permute:
+                w = permute_qk(w, n_heads, DIM)
+            # Single-step quantize + Marlin pack
+            B, s = quantize_and_pack_marlin(w)
+            del w
+            out_prefix = f"layers.{layer_idx}.{hf_name}"
+            tensors[f"{out_prefix}.B"] = B
+            tensors[f"{out_prefix}.s"] = s
+            n_quantized += 1
+        gc.collect()
+        elapsed = time.time() - t0
+        print(f"  Layer {layer_idx + 1}/{N_LAYERS} quantized ({elapsed:.1f}s)")
+    print(f"\nQuantized {n_quantized} decoder linear weights to Marlin INT4")
+    print(f"Total tensors in output: {len(tensors)}")
+    # ── Save ──
+    print(f"\nSaving to {output_path}...")
+    save_file(tensors, output_path)
+    file_size = os.path.getsize(output_path)
+    print(f"Output: {file_size / (1024**3):.2f} GB ({len(tensors)} tensors)")
+    # ── Copy auxiliary files ──
+    for aux in ["params.json", "tekken.json"]:
+        src = os.path.join(args.model_dir, aux)
+        if os.path.exists(src):
+            shutil.copy2(src, os.path.join(args.output_dir, aux))
+            print(f"Copied {aux}")
+    print(f"\nDone in {time.time() - t0:.1f}s")
+    # ── Verify tensor names ──
+    print(f"\nSample Marlin tensor names:")
+    marlin_keys = sorted(k for k in tensors if k.endswith(".B"))[:5]
+    for k in marlin_keys:
+        print(f"  {k}: {list(tensors[k].shape)} {tensors[k].dtype}")
+        sk = k[:-2] + ".s"
+        print(f"  {sk}: {list(tensors[sk].shape)} {tensors[sk].dtype}")
+if __name__ == "__main__":
+    main()