SLM750-Edge 1.58-bit — Q4_K_M Quantized GGUF

A compact, efficiently quantized BitNet b1.58 ternary model optimized for edge deployment. This repository provides a plug-and-play GGUF file ready for use with llama.cpp and its ecosystem (llama-cli, llama-cpp-python, text-generation-webui, and more).

⚠️ IMPORTANT — PLEASE READ: This model does NOT run with the standard llama.cpp. It requires a patched bitnet.cpp (see below). Without the patch, you will encounter missing tensor 'blk.0.attn_sub_norm.weight' or similar errors. The BitNet folder contains benchmark and automatic installation scripts & fixes for known problems during the installation of Python Torch and sentencepiece from python3.13/sitepackages.


SLM750-Edge: Distillation + GGUF Build Colab

slm750_edge_gpu_package.tgz.gz Google Drive


Model Highlights

Attribute Value
Architecture BitNet b1.58 (ternary {-1, 0, +1} weights)
Parameters ~1.4B
Embedding Dim 1,536
Attention Heads 12 (4 KV heads, GQA)
Feed-Forward 4,096 (ReLU² activation)
Context Length 8,192 tokens
Vocabulary 256,000 tokens (Gemma-2 tokenizer)
Quantization Q4_K_M (5.27 BPW)
File Size 873 MB
License BitNet 1.58

File Description

File Description
quantized_q4km.gguf — Q4_K_M quantized GGUF
Recommended — Q4_K_M quantized GGUF quant_fixed_from_source_gguf (921 MB)

Quick Start

Prerequisites

ik_llama This repository is a fork of llama.cpp with better CPU and hybrid GPU/CPU performance, new SOTA quantization types, // first-class Bitnet // support, better DeepSeek performance via MLA, FlashMLA, fused MoE operations and tensor overrides for hybrid GPU/CPU inference, row-interleaved quant packing, etc.

Building llama-cpp with BitNet Support

https://copilot.microsoft.com/shares/pages/kaesS7TdPccnaLsu4iQ9T
  • llama.cpp built from source with BitNet support (commit 52b3df002 or later), OR
  • llama-cpp-python v0.3.x+

Usage with llama-cli

# Basic text generation
./llama-cli -m quantized_q4km.gguf \
  -p "Explain quantum computing in simple terms" \
  -n 256 \
  -t 4 \
  --temp 0.7 \
  --top-p 0.9

# Chat mode
./llama-cli -m quantized_q4km.gguf \
  -p "You are a helpful assistant." \
  --chat-template gemma \
  -n 512 \
  -t 4

Usage with llama-cpp-python (Python)

from llama_cpp import Llama

llm = Llama(
    model_path="quantized_q4km.gguf",
    n_ctx=8192,
    n_threads=4,
    verbose=False,
)

output = llm(
    "What is the meaning of life?",
    max_tokens=256,
    temperature=0.7,
    top_p=0.9,
    echo=False,
)

print(output["choices"][0]["text"])

Usage with text-generation-webui

  1. Place quantized_q4km.gguf in the models/ directory.
  2. Launch text-generation-webui with --model quantized_q4km.gguf.
  3. Select the model in the UI under the "Model" tab.

Usage with LangChain

from langchain_community.llms import LlamaCpp

llm = LlamaCpp(
    model_path="quantized_q4km.gguf",
    n_ctx=8192,
    n_threads=4,
    temperature=0.7,
    top_p=0.9,
    verbose=False,
)

response = llm.invoke("Write a short poem about AI.")
print(response)

Performance

# SLM750-Edge: A 1.58-Bit Hybrid Edge Transformer
## Architectural Mathematics & Training Protocol

### "Null-Denkzeit für Tool-Calls. >120 tok/s auf mobiler Hardware."

---

## 1. Design Philosophy

SLM750-Edge combines three revolutionary advances:

1. **Hybrid Gemma-2/SmolLM2 architecture** — interleaved local+global attention with Gemma-2's logit softcapping, optimized to 750M
2. **BitNet b1.58 native quantization** — all weights constrained to {−1, 0, +1} during training, eliminating FP multiplications at inference
3. **GBNF grammar ejection** — tool-calling structure is *baked into the token distribution* so no CoT or post-processing is needed

**Result:** A 750M model that fits in 142 MB at 1.58-bit (≈ 750M × 1.58 bits ÷ 8 = 148 MB), runs entirely in CPU cache on modern phone SoCs, and produces guaranteed-JSON tool calls at wire speed.

---

## 2. Architecture Blueprint

### 2.1 Macro Architecture

SLM750-Edge( token_embed: Vocab(256k) → dim=1536 # embedding table layers: 24 × HybridDecoderLayer # 12 global + 12 local (interleaved) norm_final: SubLN # pre-RMSNorm + activation quantization lm_head: dim=1536 → Vocab(256k) # tied weights (optional) )


### 2.2 Layer Configuration

| Parameter | Value | Rationale |
|-----------|-------|-----------|
| `hidden_dim` | 1536 | Power-of-2 multiple for NEON/ARM dot-product efficiency |
| `intermediate_dim` | 4096 | ReLU² activation: 1536 × 8/3 ≈ 4096 (SmolLM2 ratio) |
| `num_attention_heads` | 12 | 1536 ÷ 12 = 128 head_dim — optimal for FlashAttention-2 |
| `num_kv_heads` | 4 | GQA ratio 3:1 — 75% KV-cache reduction vs MHA |
| `head_dim` | 128 | FlashAttention-2 tile alignment (128 × 128 warp) |
| `num_layers` | 24 | 12 global + 12 sliding window (interleaved) |
| `sliding_window_size` | 1024 | Gemma-2 default — covers ~1K tokens of local context |
| `vocab_size` | 256000 | SmolLM2 tokenizer + tool-token extensions |
| `max_seq_len` | 8192 | 8K context for complex multi-step tool orchestration |

**Total parameters:** ~750M (exact: 748,634,112)

### 2.3 Parameter Budget (Exact)

| Component | Shape | Parameters |
|-----------|-------|------------|
| Embedding | 256000 × 1536 | 393,216,000 |
| Per-layer Attention Q | 1536 × 1536 | 2,359,296 |
| Per-layer Attention K | 1536 × 512 (GQA 4 KV heads) | 786,432 |
| Per-layer Attention V | 1536 × 512 | 786,432 |
| Per-layer Attention O | 1536 × 1536 | 2,359,296 |
| Per-layer FFN Gate | 1536 × 4096 | 6,291,456 |
| Per-layer FFN Up | 1536 × 4096 | 6,291,456 |
| Per-layer FFN Down | 4096 × 1536 | 6,291,456 |
| Per-layer SubLN (2x) | 2 × 1536 | 3,072 |
| **Per-layer total** | — | **25,164,288** |
| **24 layers total** | — | **603,942,912** |
| Norm final | 1536 | 1,536 |
| LM Head (tied w/ embed) | — | 0 |
| **Grand total** | — | **748,634,112** ≈ 749M |

---

## 3. The BitNet b1.58 Mathematical Core

### 3.1 Ternary Weight Quantization

Every linear weight tensor $W \in \mathbb{R}^{d \times k}$ is quantized during **both forward and backward passes**:

Wq=round_clip(Wα)whereα=1nijmax(Wij)W_q = \text{round\_clip}\left(\frac{W}{\alpha}\right) \quad \text{where} \quad \alpha = \frac{1}{n}\sum_{ij} \max(|W_{ij}|)

round_clip(x)={+1,x>0.50,x0.51,x<0.5\text{round\_clip}(x) = 
\begin{cases}
+1, & x > 0.5 \\
0, & |x| \leq 0.5 \\
-1, & x < -0.5
\end{cases}

The **scale factor** $\alpha$ is per-output-channel (per-row), stored in FP16:

αj=1di=1dmaxjWij\alpha_j = \frac{1}{d}\sum_{i=1}^{d} \max_{j} |W_{ij}|

**Total storage:** $W_q$ fits in 1.58 bits/param (2 bits packed with ternary encoding), $\alpha$ is FP16 per row — negligible overhead.

### 3.2 Straight-Through Estimator (STE) for Gradient Flow

The round-clip step is non-differentiable. We use STE:

LWLWq1Wα\frac{\partial \mathcal{L}}{\partial W} \approx \frac{\partial \mathcal{L}}{\partial W_q} \cdot \mathbf{1}_{|W| \leq \alpha}

This passes gradients through the quantization boundary while clamping outliers. The indicator function $\mathbf{1}_{|W| \leq \alpha}$ prevents weights from growing unboundedly.

### 3.3 Activation Quantization (8-bit Per-Token Abs-Max)

Activations are quantized to INT8 per-token:

Xq=round(Xγ×127)whereγ=maxjXjX_q = \text{round}\left(\frac{X}{\gamma} \times 127\right) \quad \text{where} \quad \gamma = \max_{j} |X_{j}|

Dequantization: $X \approx X_q \times \frac{\gamma}{127}$

### 3.4 SubLN: Sub-Layer Normalization

Each sublayer (attention, FFN) uses this sequence:

1. **Pre-RMSNorm:** $x' = \text{RMSNorm}(x)$
2. **Quantize:** $x'_q = \text{activation\_quant}(x')$ (INT8, per-token abs-max)
3. **Compute:** $y = \text{BitLinear}(x'_q, W_q, \alpha)$ — ternary weights × INT8 activations
4. **Residual:** $z = x + y$

**Inference math:** $W_q \in \{-1, 0, +1\}$, $X_q \in \mathbb{Z}_{127}$, so the core computation is:

Yij=αjkWqjkXqikY_{ij} = \alpha_j \cdot \sum_{k} W_{q_{jk}} \cdot X_{q_{ik}}

This is **pure integer addition** — the ternary weight selects between add, skip, or subtract of each activation. No FP multipliers needed until the final $\alpha_j$ scale.

### 3.5 Theoretical Speedup

| Operation | FP16 MAC | Ternary-INT8 Ops | Speedup |
|-----------|----------|-------------------|---------|
| Weight × Activation | FP16 multiply-add | INT8 add-with-sign | ~4× |
| Memory bandwidth | 2 bytes/param | 0.1975 bytes/param | ~10× |
| KV cache (4K seq) | 4K × 512 × 2B = 4 MB | 4K × 512 × 2B = 4 MB | Same (GQA) |

---

## 4. Hybrid Attention: Gemma-2 × FlashAttention-2

### 4.1 Interleaved Local-Global Pattern

Layer 0: Global Attention (full 8192 ctx) ← FlashAttention-2 Layer 1: Sliding Window (1024 local) ← FlashAttention-2 mask Layer 2: Global Attention Layer 3: Sliding Window ... (repeat) Layer 22: Global Attention Layer 23: Sliding Window


**Why:** Global at even layers captures long-range tool dependencies (passing variables between steps 1→8). Sliding window at odd layers provides 1024-token burst resolution for dense JSON structures.

### 4.2 Logit Softcapping (Gemma-2 Innovation)

After attention score computation and before softmax:

Sij=Cattntanh(SijCattn)whereCattn=50S'_{ij} = C_{\text{attn}} \cdot \tanh\left(\frac{S_{ij}}{C_{\text{attn}}}\right) \quad \text{where} \quad C_{\text{attn}} = 50

After the final FFN output projection:

Y=Cfinaltanh(YCfinal)whereCfinal=30Y' = C_{\text{final}} \cdot \tanh\left(\frac{Y}{C_{\text{final}}}\right) \quad \text{where} \quad C_{\text{final}} = 30

**Effect:** Prevents attention from becoming too diffuse or too peaked, critical for stable ternary training.

### 4.3 RoPE with Extended Frequency Scale

Using SmolLM2's RoPE implementation with base $f = 10000$:

Θ={θi=100002i/d}i=0d/21\Theta = \{\theta_i = 10000^{-2i/d}\}_{i=0}^{d/2-1}

Extended to 8192 via linear frequency scaling (NTK-aware):

θi=θis2idwheres=81922048=4\theta'_i = \theta_i \cdot s^{-\frac{2i}{d}} \quad \text{where} \quad s = \frac{8192}{2048} = 4

---

## 5. ReLU² Feed-Forward Network

SmolLM2-style FFN with squared ReLU activation:

FFN(x)=Wdown(ReLU(Wupx)2Wgatex)\text{FFN}(x) = W_{\text{down}} \cdot \left(\text{ReLU}(W_{\text{up}} \cdot x)^2 \odot W_{\text{gate}} \cdot x\right)

Where $\odot$ is element-wise multiplication (SwiGLU-style gating but with ReLU²).

**Rationale:** ReLU² produces sparser activations than GELU/SwiGLU at the same FLOPs, which synergizes with ternary weights — more zero activations mean more zero-selects in the INT8 adder tree.

---

## 6. Knowledge Distillation Protocol

### 6.1 Teacher-Student Setup

| Role | Model | Parameters | Format |
|------|-------|------------|--------|
| **Teacher (Reasoning)** | Gemma-2-2B (Open) | 2.6B | FP16, full precision |
| **Teacher (Tool Structure)** | LFM2-1.2B-Tool | 1.2B | Q4_K_M GGUF |
| **Student** | SLM750-Edge | 749M | 1.58-bit ternary |

### 6.2 Distillation Loss

L=λ1LCE(ys,yt)+λ2LKD(ps,pt)+λ3Lstruct(zs,zt)\mathcal{L} = \lambda_1 \cdot \mathcal{L}_{\text{CE}}(y_s, y_t) + \lambda_2 \cdot \mathcal{L}_{\text{KD}}(p_s, p_t) + \lambda_3 \cdot \mathcal{L}_{\text{struct}}(z_s, z_t)

Where:

- $\mathcal{L}_{\text{CE}}$ is standard cross-entropy on hard labels
- $\mathcal{L}_{\text{KD}} = T^2 \cdot \text{KL}(p_s/T \parallel p_t/T)$ is distilled soft logits (temperature $T=4$)
- $\mathcal{L}_{\text{struct}}$ is a structured output loss comparing JSON AST trees via tree-edit distance
- $\lambda_1 = 0.3, \lambda_2 = 0.5, \lambda_3 = 0.2$

### 6.3 Namespace Correction (Online Patch)

During distillation data generation, the LFM2 teacher produces actions with incorrect namespaces (e.g., `data_science_engineering.create_post` instead of `HR.Recruiting.CreateJobPosting`). We apply a **dynamic namespace projection**:

1. Parse JSON output from LFM2 teacher
2. Extract action tuples (namespace, function)
3. Project through a similarity-weighted lookup table built from the ground-truth dataset:
   $$(n_s, f_s) \rightarrow \arg\max_{(n_t, f_t) \in \mathcal{G}} \text{sim}(n_s, n_t) \cdot \text{sim}(f_s, f_t)$$
4. Rewrite the JSON with corrected namespaces **in RAM** before feeding to student

This table is constructed offline from the 2000 training samples before distillation begins.

---

## 7. GBNF Grammar for Guaranteed Tool-Call JSON

The inference grammar is compiled into a DAG that the sampler follows token-by-token:

```gbnf
root   ::= "{\"actions\": " actions-array ", \"dependencies\": " dep-array ", \"variable_chain\": " vc-array "}"
actions-array ::= "[" action (", " action)* "]"
action ::= "{" ws "\"step\":" ws number ws "," ws
            "\"namespace\":" ws string ws "," ws
            "\"function\":" ws string ws "," ws
            "\"params\":" ws object ws "," ws
            "\"depends_on\":" ws array ws "," ws
            "\"output_refs\":" ws (object | array) ws "," ws
            "\"rollback_ref\":" ws (string | "null") ws "," ws
            "\"condition\":" ws string ws "}"

At inference time, this grammar is compiled by llama.cpp's GBNF engine into a deterministic pushdown automaton. The sampler never proposes a token outside the grammar, producing 100% valid JSON on the first sampled sequence.

Key insight: Because the model was trained with 1.58-bit quantization and on grammar-structured data, the ternary weights learn to assign high probability only to grammar-valid token sequences. The grammar acts as a safety net — the model rarely needs it because the distribution is already near-deterministic for tool calls.


8. Training Protocol

8.1 Phase 1: Warm-up (500 steps)

  • Full FP16 precision (no quantization)
  • Learning rate: 1e-4 → 1e-3 linear warmup
  • Standard AdamW (β₁=0.9, β₂=0.95)
  • Batch size: 64 sequences × 2048 tokens

8.2 Phase 2: Quantization-Aware Distillation (2000 steps)

  • Enable BitLinear (ternary weights + INT8 activations)
  • Enable distillation loss with Gemma-2 teacher
  • Learning rate: 1e-3 → 1e-4 cosine decay
  • Batch size: 32 sequences × 2048 tokens (reduced due to teacher memory)
  • Gradient clipping: max_norm=1.0

8.3 Phase 3: Studio-Fine (1000 steps)

  • Freeze all BitLinear scales ($\alpha$)
  • Unfreeze only the LM head and embedding layer
  • Fine-tune with GBNF-constrained outputs as training targets
  • Learning rate: 5e-5 constant

8.4 Inference Deployment

1. Export:  W_q (ternary) + α (FP16 per-channel) + embeddings → GGUF
2. Integrate GBNF grammar with llama.cpp grammar engine
3. Compile with:
   - ARM NEON dot-product kernel for ternary×INT8 matmul
   - FlashAttention-2 for sliding window + global attention
   - Prefill:  >120 tok/s on Snapdragon 8 Gen 3 (4x Cortex-A720)
   - Decode:  >60 tok/s (single token at a time, bandwidth-bound)

10. Code Map

File Purpose
bitnet_linear.py BitLinear layer with STE ternary quantization
slm750_model.py Full 750M hybrid architecture with SubLN
distill_train.py Knowledge distillation loop with Gemma-2 teacher
namespace_fix.py Dynamic namespace correction table builder
tool_grammar.gbnf GBNF grammar for tool-call JSON
gather_training_data.py Generate distilled dataset from LFM2 teacher

Key Architectural Features

Component	Specification
Weight Precision	Ternary {-1, 0, +1} (training), Q4_K_M (storage)
FFN Activation	ReLU² (relu(x)²)
Attention	Grouped-Query Attention (GQA), 12 heads, 4 KV heads
Positional Encoding	RoPE (Rotary Position Embeddings)
Normalization	RMSNorm (epsilon = 1e-6)
Logit Softcapping	Attention: 50.0, Final: 30.0 (tanh-based)
Context Length	8,192 tokens
Quantization Format
The model is quantized using Q4_K_M (4-bit K-quant, medium size):
File type: LLAMA_FTYPE_MOSTLY_Q4_K_M (15)
BPW: 5.27 bits per weight (including overhead)
Compression ratio: ~6:1 vs. full precision
Method: llama.cpp llama-quantize with --allow-requantize

Compatibility

Supported Runtimes
Runtime	Status	Notes
llama.cpp (mainline)	✅ Full	Requires LLM_ARCH_BITNET support (commit 52b3df002+)
llama-cpp-python	✅ Full	v0.3.x+ with BitNet support
text-generation-webui	✅ Full	Via llama.cpp backend
LangChain	✅ Full	Via LlamaCpp wrapper
Ollama	⚠️ Manual	Requires custom Modelfile; not officially supported
llama-cpp.server	✅ Full	OpenAI-compatible API server

Known Limitations

GPU offloading is not supported for BitNet architectures in the current llama.cpp release — all inference runs on CPU. Flash Attention is not compatible with the BitNet attention implementation. Batch inference (parallel decoding) is limited by the CPU-only constraint.

Thanks to

Downloads last month
1,439
GGUF
Model size
1B params
Architecture
bitnet
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Qapdex/SLM750-Edge-1.58-bit

Unable to build the model tree, the base model loops to the model itself. Learn more.

Dataset used to train Qapdex/SLM750-Edge-1.58-bit

Paper for Qapdex/SLM750-Edge-1.58-bit