Qwen2.5-Coder-32B-Instruct-BitNet-1.58b

Architecture: 32 Billion Parameters | BitNet 1.58-bit Ternary Quantization


IMPORTANT: Parameter Count Display

HuggingFace displays "9B params" because it counts packed bytes, not actual parameters. This model has the full 32B parameter Qwen2.5-Coder architecture. The weights are stored as ternary values ({-1, 0, +1}) packed 4 per byte, which reduces storage to 9.6 GB but preserves all 32 billion parameters.


Overview

This is an experimental BitNet 1.58-bit quantization of the Qwen2.5-Coder-32B-Instruct model using absmean scaling with group-wise quantization. The model stores weights as ternary values ({-1, 0, +1}) packed 4 values per byte.

This is research/experimental work. Quality and performance have not been formally benchmarked.

Specifications

Property Value
Base Model Qwen/Qwen2.5-Coder-32B-Instruct
Architecture Qwen2 (Qwen2ForCausalLM)
Parameters 32B (full architecture preserved)
Quantization BitNet 1.58-bit ternary
Bits per Weight ~1.58
Group Size 64
Original Size 65.53 GB (BF16)
Quantized Size 9.6 GB (SafeTensors)
GGUF Size 11 GB (TQ2_0)
Compression ~6.4x

Formats

Format File Description
SafeTensors model-*.safetensors Sharded quantized weights + scales
GGUF qwen2.5-coder-32b-TQ2_0.gguf llama.cpp TQ2_0 format (experimental)

GGUF Compatibility Note: The GGUF conversion is experimental. Our BitNet quantization uses group size 64, while TQ2_0 uses 256-element blocks. This may cause compatibility issues with some inference engines. The SafeTensors format is the primary supported format.

Quantization Method

Algorithm

  1. Reshape weights into groups of 64
  2. Compute per-group scale: scale = mean(|weights|)
  3. Normalize and round to nearest ternary: q = round(w / scale) clamped to {-1, 0, +1}
  4. Map to unsigned: {-1, 0, +1} → {0, 1, 2}
  5. Pack 4 values per byte: v0 + v1*3 + v2*9 + v3*27

Tooling

  • Quantization: Custom Rust tool using Candle
  • GGUF Conversion: llama.cpp convert_hf_to_gguf.py

Hardware Used

  • GPU: NVIDIA RTX 5080 (16GB VRAM)
  • Quantization time: ~369 seconds (streaming mode)
  • Memory: Streaming mode with CPU fallback for large tensors (>3GB threshold)

Usage

With Ollama/llama.cpp (experimental)

# llama.cpp (GGUF format - experimental, may have issues)
./llama-cli -m qwen2.5-coder-32b-TQ2_0.gguf -p "Write a Python function:"

Unpacking Weights (Python)

def unpack_ternary(packed_byte):
    """Unpack 4 ternary values from byte."""
    values = []
    val = packed_byte
    for _ in range(4):
        values.append((val % 3) - 1)  # {0,1,2} → {-1,0,+1}
        val //= 3
    return values

Limitations

  • Quality not benchmarked - May have significant degradation vs original
  • Requires custom runtime - Standard transformers doesn't support ternary weights
  • Experimental - Not intended for production use without evaluation
  • GGUF keeps embeddings/lm_head at F16, hence larger than SafeTensors
  • HuggingFace may show incorrect param count due to packed storage

License

Apache 2.0 (inherited from Qwen2.5-Coder-32B-Instruct)

Citation

@misc{qwen-coder-32b-bitnet-2025,
  title={Qwen2.5-Coder-32B-BitNet-1.58b: Experimental BitNet Quantization},
  author={Tzervas},
  year={2025},
  url={https://huggingface.co/tzervas/qwen2.5-coder-32b-bitnet-1.58b}
}
Downloads last month
8,939
Safetensors
Model size
9B params
Tensor type
F32
·
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tzervas/qwen2.5-coder-32b-bitnet-1.58b

Base model

Qwen/Qwen2.5-32B
Quantized
(115)
this model