Zamba2-7B-Instruct-HXQ
2.0x smaller from BF16. 81-layer hybrid Mamba2+Transformer. Largest HXQ hybrid model.
Zamba2-7B-Instruct compressed from 14.7 GB (BF16) to 7.5 GB. 213 linear layers compressed, 573 exact tensors preserved. No calibration data. Just
pip installandfrom_pretrained().
Install and Run
pip install "helix-substrate[hf]"
import helix_substrate # registers the HXQ quantizer with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("EchoLabs33/zamba2-7b-instruct-hxq")
tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/zamba2-7b-instruct-hxq")
inputs = tokenizer("Explain the theory of relativity in simple terms:", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
That's it. import helix_substrate registers the quantizer. from_pretrained() handles the rest automatically.
Benchmark
| Dense (BF16) | HXQ | |
|---|---|---|
| Size | 14.7 GB | 7.5 GB |
| Perplexity (WikiText-2) | pending | pending |
| Compression ratio | β | 2.0x |
| Compressed modules | β | 213 HelixLinear layers |
| Architecture | Zamba2 (81 layers, Mamba2 + shared Transformer) | unchanged |
Good to Know
- GPU recommended β 7.5 GB requires 10+ GB VRAM. Use
device_map="auto"for multi-GPU. - Not fine-tunable β compressed weights are read-only (
is_trainable = False). - Requires
helix-substrateβ the quantizer is not built into transformers. You needpip install "helix-substrate[hf]". - Requires
transformers >= 4.45β for Zamba2 architecture support. mamba-ssmrecommended β without it, falls back to a slower sequential code path.- PPL pending β requires cloud GPU eval (model doesn't fit on 4 GB T2000).
What is HelixCode?
HelixCode is a universal weight compression codec based on vector quantization:
- Each weight matrix is replaced by a 256-entry codebook (float32) + uint8 index matrix + optional sidecar corrections for outlier values
- The compressed form is the executable β
HelixLinearperformscodebook[indices] @ xdirectly, no decompression step - Works on any
nn.Linearregardless of architecture (Transformer, Mamba, MLP, CNN) - No calibration data required β unlike GPTQ/AWQ, codebooks are fit from the weights alone
How It Works
import helix_substrateregisters thehxqquantizer with HuggingFacefrom_pretrained()readsquantization_config.quant_method = "hxq"fromconfig.json- The quantizer replaces 213
nn.Linearmodules withHelixLinearshells before weight loading - Safetensors populates the codebook, indices, and sidecar buffers directly
- The model runs in compressed form β no decompression needed
Architecture Details
Zamba2-7B-Instruct is a hybrid architecture with:
- 81 total layers (Mamba2 + shared Transformer hybrid)
- hidden_size=3584, attention_hidden_size=7168, 32 attention heads
- mamba_d_state=64, mamba_d_conv=4
- vocab_size=32000
213 linear layers compressed (162 Mamba projections, 38 attention/MLP, 26 LoRA adapters). Normalization layers, embeddings, conv1d, and Mamba-specific parameters (A_log, D, dt_bias) are stored at full precision.
Compression Receipt
Compressed modules: 213
Exact tensors: 573 (norms, embeddings, conv1d, A_log, D, dt_bias, LoRA)
Skip tensors: 243 (from original model)
Total keys: 1425
Dense size: 14.7 GB (BF16)
Compressed size: 7.5 GB
Compression ratio: 2.0x
PPL delta: pending (cloud GPU eval)
Gate 1: PASS (structural validation + SHA256)
Companion Models
Same codec, same pip install, multiple architectures:
| Model | Architecture | Ratio | PPL Delta |
|---|---|---|---|
| qwen2.5-14b-instruct-helix | Transformer | 3.4x | pending |
| qwen2.5-7b-instruct-helix | Transformer | 2.2x | +6.34% |
| qwen2.5-3b-instruct-helix | Transformer | 1.6x | +0.69% |
| qwen2.5-coder-3b-helix | Transformer (code) | 1.6x | +1.92% |
| qwen2.5-coder-1.5b-instruct-helix | Transformer (code) | 2.4x | +1.63% |
| tinyllama-1.1b-helix | Transformer | 4.0x | +0.78% |
| zamba2-2.7b-instruct-helix | Hybrid (Mamba2+Transformer) | 1.8x | +6.59% |
| zamba2-1.2b-helix | Hybrid (Mamba2+Transformer) | 1.7x | +2.90% |
| mamba2-1.3b-helix | Pure SSM (Mamba2) | 2.1x | +8.0% |
| mamba-130m-helix | Pure SSM | 3.8x | +18.4% |
Citation
@software{helix_substrate_2026,
title={Helix Substrate: Universal Weight Compression via HelixCode},
author={EchoLabs},
year={2026},
url={https://github.com/echo313unfolding/helix-substrate}
}
License
Apache 2.0 (inherited from Zyphra/Zamba2-7B-instruct).
- Downloads last month
- 754