Instructions to use Qapdex/SLM750-Edge-1.58-bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Qapdex/SLM750-Edge-1.58-bit with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Qapdex/SLM750-Edge-1.58-bit", filename="quantized_q4km.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Qapdex/SLM750-Edge-1.58-bit with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT # Run inference directly in the terminal: llama cli -hf Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT # Run inference directly in the terminal: llama cli -hf Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT # Run inference directly in the terminal: ./llama-cli -hf Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT # Run inference directly in the terminal: ./build/bin/llama-cli -hf Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT
Use Docker
docker model run hf.co/Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT
- LM Studio
- Jan
- vLLM
How to use Qapdex/SLM750-Edge-1.58-bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Qapdex/SLM750-Edge-1.58-bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qapdex/SLM750-Edge-1.58-bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT
- Ollama
How to use Qapdex/SLM750-Edge-1.58-bit with Ollama:
ollama run hf.co/Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT
- Unsloth Studio
How to use Qapdex/SLM750-Edge-1.58-bit with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Qapdex/SLM750-Edge-1.58-bit to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Qapdex/SLM750-Edge-1.58-bit to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Qapdex/SLM750-Edge-1.58-bit to start chatting
- Atomic Chat new
- Docker Model Runner
How to use Qapdex/SLM750-Edge-1.58-bit with Docker Model Runner:
docker model run hf.co/Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT
- Lemonade
How to use Qapdex/SLM750-Edge-1.58-bit with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Qapdex/SLM750-Edge-1.58-bit:Q4_K_M_QUANT
Run and chat with the model
lemonade run user.SLM750-Edge-1.58-bit-Q4_K_M_QUANT
List all available models
lemonade list
SLM750-Edge 1.58-bit — Q4_K_M Quantized GGUF
A compact, efficiently quantized BitNet b1.58 ternary model optimized for edge deployment. This repository provides a plug-and-play GGUF file ready for use with llama.cpp and its ecosystem (llama-cli, llama-cpp-python, text-generation-webui, and more).
⚠️ IMPORTANT — PLEASE READ: This model does NOT run with the standard
llama.cpp. It requires a patchedbitnet.cpp(see below). Without the patch, you will encountermissing tensor 'blk.0.attn_sub_norm.weight'or similar errors. The BitNet folder contains benchmark and automatic installation scripts & fixes for known problems during the installation of Python Torch and sentencepiece from python3.13/sitepackages.
SLM750-Edge: Distillation + GGUF Build Colab
slm750_edge_gpu_package.tgz.gz Google Drive
Model Highlights
| Attribute | Value |
|---|---|
| Architecture | BitNet b1.58 (ternary {-1, 0, +1} weights) |
| Parameters | ~1.4B |
| Embedding Dim | 1,536 |
| Attention Heads | 12 (4 KV heads, GQA) |
| Feed-Forward | 4,096 (ReLU² activation) |
| Context Length | 8,192 tokens |
| Vocabulary | 256,000 tokens (Gemma-2 tokenizer) |
| Quantization | Q4_K_M (5.27 BPW) |
| File Size | 873 MB |
| License | BitNet 1.58 |
File Description
| File | Description |
|---|---|
quantized_q4km.gguf |
— Q4_K_M quantized GGUF |
✅ Recommended |
— Q4_K_M quantized GGUF quant_fixed_from_source_gguf (921 MB) |
Quick Start
Prerequisites
ik_llama This repository is a fork of llama.cpp with better CPU and hybrid GPU/CPU performance, new SOTA quantization types, // first-class Bitnet // support, better DeepSeek performance via MLA, FlashMLA, fused MoE operations and tensor overrides for hybrid GPU/CPU inference, row-interleaved quant packing, etc.
Building llama-cpp with BitNet Support
https://copilot.microsoft.com/shares/pages/kaesS7TdPccnaLsu4iQ9T
- llama.cpp built from source with BitNet support (commit
52b3df002or later), OR - llama-cpp-python v0.3.x+
Usage with llama-cli
# Basic text generation
./llama-cli -m quantized_q4km.gguf \
-p "Explain quantum computing in simple terms" \
-n 256 \
-t 4 \
--temp 0.7 \
--top-p 0.9
# Chat mode
./llama-cli -m quantized_q4km.gguf \
-p "You are a helpful assistant." \
--chat-template gemma \
-n 512 \
-t 4
Usage with llama-cpp-python (Python)
from llama_cpp import Llama
llm = Llama(
model_path="quantized_q4km.gguf",
n_ctx=8192,
n_threads=4,
verbose=False,
)
output = llm(
"What is the meaning of life?",
max_tokens=256,
temperature=0.7,
top_p=0.9,
echo=False,
)
print(output["choices"][0]["text"])
Usage with text-generation-webui
- Place quantized_q4km.gguf in the models/ directory.
- Launch text-generation-webui with --model quantized_q4km.gguf.
- Select the model in the UI under the "Model" tab.
Usage with LangChain
from langchain_community.llms import LlamaCpp
llm = LlamaCpp(
model_path="quantized_q4km.gguf",
n_ctx=8192,
n_threads=4,
temperature=0.7,
top_p=0.9,
verbose=False,
)
response = llm.invoke("Write a short poem about AI.")
print(response)
Performance
# SLM750-Edge: A 1.58-Bit Hybrid Edge Transformer
## Architectural Mathematics & Training Protocol
### "Null-Denkzeit für Tool-Calls. >120 tok/s auf mobiler Hardware."
---
## 1. Design Philosophy
SLM750-Edge combines three revolutionary advances:
1. **Hybrid Gemma-2/SmolLM2 architecture** — interleaved local+global attention with Gemma-2's logit softcapping, optimized to 750M
2. **BitNet b1.58 native quantization** — all weights constrained to {−1, 0, +1} during training, eliminating FP multiplications at inference
3. **GBNF grammar ejection** — tool-calling structure is *baked into the token distribution* so no CoT or post-processing is needed
**Result:** A 750M model that fits in 142 MB at 1.58-bit (≈ 750M × 1.58 bits ÷ 8 = 148 MB), runs entirely in CPU cache on modern phone SoCs, and produces guaranteed-JSON tool calls at wire speed.
---
## 2. Architecture Blueprint
### 2.1 Macro Architecture
SLM750-Edge( token_embed: Vocab(256k) → dim=1536 # embedding table layers: 24 × HybridDecoderLayer # 12 global + 12 local (interleaved) norm_final: SubLN # pre-RMSNorm + activation quantization lm_head: dim=1536 → Vocab(256k) # tied weights (optional) )
### 2.2 Layer Configuration
| Parameter | Value | Rationale |
|-----------|-------|-----------|
| `hidden_dim` | 1536 | Power-of-2 multiple for NEON/ARM dot-product efficiency |
| `intermediate_dim` | 4096 | ReLU² activation: 1536 × 8/3 ≈ 4096 (SmolLM2 ratio) |
| `num_attention_heads` | 12 | 1536 ÷ 12 = 128 head_dim — optimal for FlashAttention-2 |
| `num_kv_heads` | 4 | GQA ratio 3:1 — 75% KV-cache reduction vs MHA |
| `head_dim` | 128 | FlashAttention-2 tile alignment (128 × 128 warp) |
| `num_layers` | 24 | 12 global + 12 sliding window (interleaved) |
| `sliding_window_size` | 1024 | Gemma-2 default — covers ~1K tokens of local context |
| `vocab_size` | 256000 | SmolLM2 tokenizer + tool-token extensions |
| `max_seq_len` | 8192 | 8K context for complex multi-step tool orchestration |
**Total parameters:** ~750M (exact: 748,634,112)
### 2.3 Parameter Budget (Exact)
| Component | Shape | Parameters |
|-----------|-------|------------|
| Embedding | 256000 × 1536 | 393,216,000 |
| Per-layer Attention Q | 1536 × 1536 | 2,359,296 |
| Per-layer Attention K | 1536 × 512 (GQA 4 KV heads) | 786,432 |
| Per-layer Attention V | 1536 × 512 | 786,432 |
| Per-layer Attention O | 1536 × 1536 | 2,359,296 |
| Per-layer FFN Gate | 1536 × 4096 | 6,291,456 |
| Per-layer FFN Up | 1536 × 4096 | 6,291,456 |
| Per-layer FFN Down | 4096 × 1536 | 6,291,456 |
| Per-layer SubLN (2x) | 2 × 1536 | 3,072 |
| **Per-layer total** | — | **25,164,288** |
| **24 layers total** | — | **603,942,912** |
| Norm final | 1536 | 1,536 |
| LM Head (tied w/ embed) | — | 0 |
| **Grand total** | — | **748,634,112** ≈ 749M |
---
## 3. The BitNet b1.58 Mathematical Core
### 3.1 Ternary Weight Quantization
Every linear weight tensor $W \in \mathbb{R}^{d \times k}$ is quantized during **both forward and backward passes**:
The **scale factor** $\alpha$ is per-output-channel (per-row), stored in FP16:
**Total storage:** $W_q$ fits in 1.58 bits/param (2 bits packed with ternary encoding), $\alpha$ is FP16 per row — negligible overhead.
### 3.2 Straight-Through Estimator (STE) for Gradient Flow
The round-clip step is non-differentiable. We use STE:
This passes gradients through the quantization boundary while clamping outliers. The indicator function $\mathbf{1}_{|W| \leq \alpha}$ prevents weights from growing unboundedly.
### 3.3 Activation Quantization (8-bit Per-Token Abs-Max)
Activations are quantized to INT8 per-token:
Dequantization: $X \approx X_q \times \frac{\gamma}{127}$
### 3.4 SubLN: Sub-Layer Normalization
Each sublayer (attention, FFN) uses this sequence:
1. **Pre-RMSNorm:** $x' = \text{RMSNorm}(x)$
2. **Quantize:** $x'_q = \text{activation\_quant}(x')$ (INT8, per-token abs-max)
3. **Compute:** $y = \text{BitLinear}(x'_q, W_q, \alpha)$ — ternary weights × INT8 activations
4. **Residual:** $z = x + y$
**Inference math:** $W_q \in \{-1, 0, +1\}$, $X_q \in \mathbb{Z}_{127}$, so the core computation is:
This is **pure integer addition** — the ternary weight selects between add, skip, or subtract of each activation. No FP multipliers needed until the final $\alpha_j$ scale.
### 3.5 Theoretical Speedup
| Operation | FP16 MAC | Ternary-INT8 Ops | Speedup |
|-----------|----------|-------------------|---------|
| Weight × Activation | FP16 multiply-add | INT8 add-with-sign | ~4× |
| Memory bandwidth | 2 bytes/param | 0.1975 bytes/param | ~10× |
| KV cache (4K seq) | 4K × 512 × 2B = 4 MB | 4K × 512 × 2B = 4 MB | Same (GQA) |
---
## 4. Hybrid Attention: Gemma-2 × FlashAttention-2
### 4.1 Interleaved Local-Global Pattern
Layer 0: Global Attention (full 8192 ctx) ← FlashAttention-2 Layer 1: Sliding Window (1024 local) ← FlashAttention-2 mask Layer 2: Global Attention Layer 3: Sliding Window ... (repeat) Layer 22: Global Attention Layer 23: Sliding Window
**Why:** Global at even layers captures long-range tool dependencies (passing variables between steps 1→8). Sliding window at odd layers provides 1024-token burst resolution for dense JSON structures.
### 4.2 Logit Softcapping (Gemma-2 Innovation)
After attention score computation and before softmax:
After the final FFN output projection:
**Effect:** Prevents attention from becoming too diffuse or too peaked, critical for stable ternary training.
### 4.3 RoPE with Extended Frequency Scale
Using SmolLM2's RoPE implementation with base $f = 10000$:
Extended to 8192 via linear frequency scaling (NTK-aware):
---
## 5. ReLU² Feed-Forward Network
SmolLM2-style FFN with squared ReLU activation:
Where $\odot$ is element-wise multiplication (SwiGLU-style gating but with ReLU²).
**Rationale:** ReLU² produces sparser activations than GELU/SwiGLU at the same FLOPs, which synergizes with ternary weights — more zero activations mean more zero-selects in the INT8 adder tree.
---
## 6. Knowledge Distillation Protocol
### 6.1 Teacher-Student Setup
| Role | Model | Parameters | Format |
|------|-------|------------|--------|
| **Teacher (Reasoning)** | Gemma-2-2B (Open) | 2.6B | FP16, full precision |
| **Teacher (Tool Structure)** | LFM2-1.2B-Tool | 1.2B | Q4_K_M GGUF |
| **Student** | SLM750-Edge | 749M | 1.58-bit ternary |
### 6.2 Distillation Loss
Where:
- $\mathcal{L}_{\text{CE}}$ is standard cross-entropy on hard labels
- $\mathcal{L}_{\text{KD}} = T^2 \cdot \text{KL}(p_s/T \parallel p_t/T)$ is distilled soft logits (temperature $T=4$)
- $\mathcal{L}_{\text{struct}}$ is a structured output loss comparing JSON AST trees via tree-edit distance
- $\lambda_1 = 0.3, \lambda_2 = 0.5, \lambda_3 = 0.2$
### 6.3 Namespace Correction (Online Patch)
During distillation data generation, the LFM2 teacher produces actions with incorrect namespaces (e.g., `data_science_engineering.create_post` instead of `HR.Recruiting.CreateJobPosting`). We apply a **dynamic namespace projection**:
1. Parse JSON output from LFM2 teacher
2. Extract action tuples (namespace, function)
3. Project through a similarity-weighted lookup table built from the ground-truth dataset:
$$(n_s, f_s) \rightarrow \arg\max_{(n_t, f_t) \in \mathcal{G}} \text{sim}(n_s, n_t) \cdot \text{sim}(f_s, f_t)$$
4. Rewrite the JSON with corrected namespaces **in RAM** before feeding to student
This table is constructed offline from the 2000 training samples before distillation begins.
---
## 7. GBNF Grammar for Guaranteed Tool-Call JSON
The inference grammar is compiled into a DAG that the sampler follows token-by-token:
```gbnf
root ::= "{\"actions\": " actions-array ", \"dependencies\": " dep-array ", \"variable_chain\": " vc-array "}"
actions-array ::= "[" action (", " action)* "]"
action ::= "{" ws "\"step\":" ws number ws "," ws
"\"namespace\":" ws string ws "," ws
"\"function\":" ws string ws "," ws
"\"params\":" ws object ws "," ws
"\"depends_on\":" ws array ws "," ws
"\"output_refs\":" ws (object | array) ws "," ws
"\"rollback_ref\":" ws (string | "null") ws "," ws
"\"condition\":" ws string ws "}"
At inference time, this grammar is compiled by llama.cpp's GBNF engine into a deterministic pushdown automaton. The sampler never proposes a token outside the grammar, producing 100% valid JSON on the first sampled sequence.
Key insight: Because the model was trained with 1.58-bit quantization and on grammar-structured data, the ternary weights learn to assign high probability only to grammar-valid token sequences. The grammar acts as a safety net — the model rarely needs it because the distribution is already near-deterministic for tool calls.
8. Training Protocol
8.1 Phase 1: Warm-up (500 steps)
- Full FP16 precision (no quantization)
- Learning rate: 1e-4 → 1e-3 linear warmup
- Standard AdamW (β₁=0.9, β₂=0.95)
- Batch size: 64 sequences × 2048 tokens
8.2 Phase 2: Quantization-Aware Distillation (2000 steps)
- Enable BitLinear (ternary weights + INT8 activations)
- Enable distillation loss with Gemma-2 teacher
- Learning rate: 1e-3 → 1e-4 cosine decay
- Batch size: 32 sequences × 2048 tokens (reduced due to teacher memory)
- Gradient clipping: max_norm=1.0
8.3 Phase 3: Studio-Fine (1000 steps)
- Freeze all BitLinear scales ($\alpha$)
- Unfreeze only the LM head and embedding layer
- Fine-tune with GBNF-constrained outputs as training targets
- Learning rate: 5e-5 constant
8.4 Inference Deployment
1. Export: W_q (ternary) + α (FP16 per-channel) + embeddings → GGUF
2. Integrate GBNF grammar with llama.cpp grammar engine
3. Compile with:
- ARM NEON dot-product kernel for ternary×INT8 matmul
- FlashAttention-2 for sliding window + global attention
- Prefill: >120 tok/s on Snapdragon 8 Gen 3 (4x Cortex-A720)
- Decode: >60 tok/s (single token at a time, bandwidth-bound)
10. Code Map
| File | Purpose |
|---|---|
bitnet_linear.py |
BitLinear layer with STE ternary quantization |
slm750_model.py |
Full 750M hybrid architecture with SubLN |
distill_train.py |
Knowledge distillation loop with Gemma-2 teacher |
namespace_fix.py |
Dynamic namespace correction table builder |
tool_grammar.gbnf |
GBNF grammar for tool-call JSON |
gather_training_data.py |
Generate distilled dataset from LFM2 teacher |
Key Architectural Features
Component Specification
Weight Precision Ternary {-1, 0, +1} (training), Q4_K_M (storage)
FFN Activation ReLU² (relu(x)²)
Attention Grouped-Query Attention (GQA), 12 heads, 4 KV heads
Positional Encoding RoPE (Rotary Position Embeddings)
Normalization RMSNorm (epsilon = 1e-6)
Logit Softcapping Attention: 50.0, Final: 30.0 (tanh-based)
Context Length 8,192 tokens
Quantization Format
The model is quantized using Q4_K_M (4-bit K-quant, medium size):
File type: LLAMA_FTYPE_MOSTLY_Q4_K_M (15)
BPW: 5.27 bits per weight (including overhead)
Compression ratio: ~6:1 vs. full precision
Method: llama.cpp llama-quantize with --allow-requantize
Compatibility
Supported Runtimes
Runtime Status Notes
llama.cpp (mainline) ✅ Full Requires LLM_ARCH_BITNET support (commit 52b3df002+)
llama-cpp-python ✅ Full v0.3.x+ with BitNet support
text-generation-webui ✅ Full Via llama.cpp backend
LangChain ✅ Full Via LlamaCpp wrapper
Ollama ⚠️ Manual Requires custom Modelfile; not officially supported
llama-cpp.server ✅ Full OpenAI-compatible API server
Known Limitations
GPU offloading is not supported for BitNet architectures in the current llama.cpp release — all inference runs on CPU. Flash Attention is not compatible with the BitNet attention implementation. Batch inference (parallel decoding) is limited by the CPU-only constraint.
Thanks to
- llama.cpp
- BitNet b1.58
- Gemma-2
- SmolLM2
- Heyneo
- Google Ai Assistant
- Microsoft Copilot
- Downloads last month
- 1,439
4-bit
Model tree for Qapdex/SLM750-Edge-1.58-bit
Unable to build the model tree, the base model loops to the model itself. Learn more.