Zamba2-7B-Instruct-HXQ

2.0x smaller from BF16. 81-layer hybrid Mamba2+Transformer. Largest HXQ hybrid model.

Zamba2-7B-Instruct compressed from 14.7 GB (BF16) to 7.5 GB. 213 linear layers compressed, 573 exact tensors preserved. No calibration data. Just pip install and from_pretrained().

Install and Run

pip install "helix-substrate[hf]"
import helix_substrate  # registers the HXQ quantizer with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EchoLabs33/zamba2-7b-instruct-hxq")
tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/zamba2-7b-instruct-hxq")

inputs = tokenizer("Explain the theory of relativity in simple terms:", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

That's it. import helix_substrate registers the quantizer. from_pretrained() handles the rest automatically.

Benchmark

Dense (BF16) HXQ
Size 14.7 GB 7.5 GB
Perplexity (WikiText-2) pending pending
Compression ratio β€” 2.0x
Compressed modules β€” 213 HelixLinear layers
Architecture Zamba2 (81 layers, Mamba2 + shared Transformer) unchanged

Good to Know

  • GPU recommended β€” 7.5 GB requires 10+ GB VRAM. Use device_map="auto" for multi-GPU.
  • Not fine-tunable β€” compressed weights are read-only (is_trainable = False).
  • Requires helix-substrate β€” the quantizer is not built into transformers. You need pip install "helix-substrate[hf]".
  • Requires transformers >= 4.45 β€” for Zamba2 architecture support.
  • mamba-ssm recommended β€” without it, falls back to a slower sequential code path.
  • PPL pending β€” requires cloud GPU eval (model doesn't fit on 4 GB T2000).

What is HelixCode?

HelixCode is a universal weight compression codec based on vector quantization:

  • Each weight matrix is replaced by a 256-entry codebook (float32) + uint8 index matrix + optional sidecar corrections for outlier values
  • The compressed form is the executable β€” HelixLinear performs codebook[indices] @ x directly, no decompression step
  • Works on any nn.Linear regardless of architecture (Transformer, Mamba, MLP, CNN)
  • No calibration data required β€” unlike GPTQ/AWQ, codebooks are fit from the weights alone

How It Works

  1. import helix_substrate registers the hxq quantizer with HuggingFace
  2. from_pretrained() reads quantization_config.quant_method = "hxq" from config.json
  3. The quantizer replaces 213 nn.Linear modules with HelixLinear shells before weight loading
  4. Safetensors populates the codebook, indices, and sidecar buffers directly
  5. The model runs in compressed form β€” no decompression needed

Architecture Details

Zamba2-7B-Instruct is a hybrid architecture with:

  • 81 total layers (Mamba2 + shared Transformer hybrid)
  • hidden_size=3584, attention_hidden_size=7168, 32 attention heads
  • mamba_d_state=64, mamba_d_conv=4
  • vocab_size=32000

213 linear layers compressed (162 Mamba projections, 38 attention/MLP, 26 LoRA adapters). Normalization layers, embeddings, conv1d, and Mamba-specific parameters (A_log, D, dt_bias) are stored at full precision.

Compression Receipt

Compressed modules:  213
Exact tensors:       573  (norms, embeddings, conv1d, A_log, D, dt_bias, LoRA)
Skip tensors:        243  (from original model)
Total keys:          1425
Dense size:          14.7 GB (BF16)
Compressed size:     7.5 GB
Compression ratio:   2.0x
PPL delta:           pending (cloud GPU eval)
Gate 1:              PASS (structural validation + SHA256)

Companion Models

Same codec, same pip install, multiple architectures:

Model Architecture Ratio PPL Delta
qwen2.5-14b-instruct-helix Transformer 3.4x pending
qwen2.5-7b-instruct-helix Transformer 2.2x +6.34%
qwen2.5-3b-instruct-helix Transformer 1.6x +0.69%
qwen2.5-coder-3b-helix Transformer (code) 1.6x +1.92%
qwen2.5-coder-1.5b-instruct-helix Transformer (code) 2.4x +1.63%
tinyllama-1.1b-helix Transformer 4.0x +0.78%
zamba2-2.7b-instruct-helix Hybrid (Mamba2+Transformer) 1.8x +6.59%
zamba2-1.2b-helix Hybrid (Mamba2+Transformer) 1.7x +2.90%
mamba2-1.3b-helix Pure SSM (Mamba2) 2.1x +8.0%
mamba-130m-helix Pure SSM 3.8x +18.4%

Citation

@software{helix_substrate_2026,
  title={Helix Substrate: Universal Weight Compression via HelixCode},
  author={EchoLabs},
  year={2026},
  url={https://github.com/echo313unfolding/helix-substrate}
}

License

Apache 2.0 (inherited from Zyphra/Zamba2-7B-instruct).

Downloads last month
754
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support