GitHub Repo | Technical Report

πŸ‘‹ Join us on Discord and WeChat

Introduction

BitCPM4-CANN-1B-unquantized is the unquantized QAT training checkpoint of the BitCPM4-CANN-1B model. This model stores the raw quantization-aware training (QAT) parameters before fake-quantizer fusionβ€”the ternary fake quantizers are defined in modeling.py and applied during forward propagation.

⚠️ This model is NOT intended for direct inference. It is designed as the starting point for fine-tuning BitCPM4-CANN. If you need a model for inference, please use the pseudo-quantized version: openbmb/BitCPM4-CANN-0.5B.

Key Characteristics

  • 🎯 Purpose: Fine-tuning only. The model weights are un-fused QAT parameters with fake quantizers embedded in the modeling.py forward logic.
  • πŸ”¬ Ternary Fake Quantizer: The forward pass in modeling.py contains ternary quantization logic (mapping weights to {-1, 0, 1} with group-wise scaling), which ensures the model continues learning under ternary constraints during fine-tuning.
  • πŸ”„ Post-Training Conversion: After fine-tuning, the model can be converted to pseudo-quantized format using the provided qat-convert.py script.

BitCPM4-CANN Model Family

Usage

Fine-tuning

This model is designed for fine-tuning with frameworks that support custom modeling code. The critical requirement is that the forward pass must go through the modeling.py file bundled with this model, which contains the ternary fake quantizer logic. This ensures the model parameters remain compatible with ternary quantization constraints throughout fine-tuning.

Supported Fine-tuning Frameworks

  • DeepSpeed (recommended): See MiniCPM Fine-tuning Guide
  • LLaMA Factory: Supports custom model loading with trust_remote_code=True
  • Other Frameworks: Any framework that supports HuggingFace-compatible model loading with custom modeling code

Important: Ensure Fake Quantizer is Active

When fine-tuning, you must ensure:

  1. Load the model with trust_remote_code=True so that the custom modeling.py (containing the ternary quantizer) is used.
  2. The forward pass during training goes through the ternary quantizer defined in modeling.pyβ€”do NOT replace or bypass the model's forward logic.
from transformers import AutoModelForCausalLM, AutoTokenizer

path = 'openbmb/BitCPM4-CANN-1B-unquantized'
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
)

# Proceed with your fine-tuning pipeline (DeepSpeed, LLaMA Factory, etc.)
# The ternary fake quantizer in modeling.py will be applied automatically during forward pass.

Post-Fine-tuning Conversion

After fine-tuning is complete, use the qat-convert.py script to fuse the fake quantizer and produce the pseudo-quantized model weights that can be used for inference:

python qat-convert.py \
    --input_bin <path-to-finetuned-pytorch.bin> \
    --output <path-to-output-pseudo-quantized-pytorch.bin> \
    --quant_type ternary \
    --group_size -1

The converted model can then be loaded for inference in the same way as openbmb/BitCPM4-CANN-1Bβ€”no special quantization libraries required.

Workflow Summary

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  BitCPM4-CANN-1B-unquantized  β”‚   ← This model (QAT parameters + fake quantizer in modeling.py)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β”‚
                β–Ό  Fine-tune (DeepSpeed / LLaMA Factory / ...)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Fine-tuned pytorch.bin         β”‚   ← Still contains un-fused QAT parameters
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β”‚
                β–Ό  python qat-convert.py --quant_type ternary --group_size -1
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Pseudo-quantized pytorch.bin    β”‚   ← Ready for inference (same format as BitCPM4-CANN-0.5B)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Technical Background

BitCPM4-CANN uses a ternary quantizer that maps each weight group to {-1, 0, 1} scaled by a group-wise factor, trained with Straight-Through Estimator (STE) for gradient flow. The unquantized checkpoint preserves the full-precision latent weights alongside the quantizer parameters, allowing the model to continue learning under quantization constraints during fine-tuning.

For full technical details, please refer to our Technical Report.

Statement

  • As a language model, BitCPM4-CANN generates content by learning from a vast amount of text.
  • However, it does not possess the ability to comprehend or express personal opinions or value judgments.
  • Any content generated by BitCPM4-CANN does not represent the viewpoints or positions of the model developers.
  • Therefore, when using content generated by BitCPM4-CANN, users should take full responsibility for evaluating and verifying it on their own.

LICENSE

  • This repository and BitCPM4-CANN models are released under the Apache-2.0 License.

Citation

  • Please cite our technical report if you find our work valuable.
@article{bitcpm4cann,
  title={{BitCPM-CANN}: Native 1.58-Bit Large Language Model Training on Ascend NPU},
  author={BitCPM Team},
  year={2026}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support