Instructions to use openbmb/BitCPM4-CANN-0.5B-unquantized with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use openbmb/BitCPM4-CANN-0.5B-unquantized with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="openbmb/BitCPM4-CANN-0.5B-unquantized", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("openbmb/BitCPM4-CANN-0.5B-unquantized", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use openbmb/BitCPM4-CANN-0.5B-unquantized with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "openbmb/BitCPM4-CANN-0.5B-unquantized" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/BitCPM4-CANN-0.5B-unquantized", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/openbmb/BitCPM4-CANN-0.5B-unquantized
- SGLang
How to use openbmb/BitCPM4-CANN-0.5B-unquantized with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "openbmb/BitCPM4-CANN-0.5B-unquantized" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/BitCPM4-CANN-0.5B-unquantized", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "openbmb/BitCPM4-CANN-0.5B-unquantized" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/BitCPM4-CANN-0.5B-unquantized", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use openbmb/BitCPM4-CANN-0.5B-unquantized with Docker Model Runner:
docker model run hf.co/openbmb/BitCPM4-CANN-0.5B-unquantized
GitHub Repo | Technical Report
π Join us on Discord and WeChat
Introduction
BitCPM4-CANN-0.5B-unquantized is the unquantized QAT training checkpoint of the BitCPM4-CANN-0.5B model. This model stores the raw quantization-aware training (QAT) parameters before fake-quantizer fusionβthe ternary fake quantizers are defined in modeling.py and applied during forward propagation.
β οΈ This model is NOT intended for direct inference. It is designed as the starting point for fine-tuning BitCPM4-CANN. If you need a model for inference, please use the pseudo-quantized version: openbmb/BitCPM4-CANN-0.5B.
Key Characteristics
- π― Purpose: Fine-tuning only. The model weights are un-fused QAT parameters with fake quantizers embedded in the
modeling.pyforward logic. - π¬ Ternary Fake Quantizer: The forward pass in
modeling.pycontains ternary quantization logic (mapping weights to {-1, 0, 1} with group-wise scaling), which ensures the model continues learning under ternary constraints during fine-tuning. - π Post-Training Conversion: After fine-tuning, the model can be converted to pseudo-quantized format using the provided
qat-convert.pyscript.
BitCPM4-CANN Model Family
| Model | HuggingFace (Inference) | HuggingFace (Fine-tuning) |
|---|---|---|
| BitCPM4-CANN-0.5B | openbmb/BitCPM4-CANN-0.5B | openbmb/BitCPM4-CANN-0.5B-unquantized |
| BitCPM4-CANN-1B | openbmb/BitCPM4-CANN-1B | openbmb/BitCPM4-CANN-1B-unquantized |
| BitCPM4-CANN-3B | openbmb/BitCPM4-CANN-3B | openbmb/BitCPM4-CANN-3B-unquantized |
| BitCPM4-CANN-8B | openbmb/BitCPM4-CANN-8B | openbmb/BitCPM4-CANN-8B-unquantized |
Usage
Fine-tuning
This model is designed for fine-tuning with frameworks that support custom modeling code. The critical requirement is that the forward pass must go through the modeling.py file bundled with this model, which contains the ternary fake quantizer logic. This ensures the model parameters remain compatible with ternary quantization constraints throughout fine-tuning.
Supported Fine-tuning Frameworks
- DeepSpeed (recommended): See MiniCPM Fine-tuning Guide
- LLaMA Factory: Supports custom model loading with
trust_remote_code=True - Other Frameworks: Any framework that supports HuggingFace-compatible model loading with custom modeling code
Important: Ensure Fake Quantizer is Active
When fine-tuning, you must ensure:
- Load the model with
trust_remote_code=Trueso that the custommodeling.py(containing the ternary quantizer) is used. - The forward pass during training goes through the ternary quantizer defined in
modeling.pyβdo NOT replace or bypass the model's forward logic.
from transformers import AutoModelForCausalLM, AutoTokenizer
path = 'openbmb/BitCPM4-CANN-0.5B-unquantized'
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
path,
torch_dtype=torch.bfloat16,
trust_remote_code=True
)
# Proceed with your fine-tuning pipeline (DeepSpeed, LLaMA Factory, etc.)
# The ternary fake quantizer in modeling.py will be applied automatically during forward pass.
Post-Fine-tuning Conversion
After fine-tuning is complete, use the qat-convert.py script to fuse the fake quantizer and produce the pseudo-quantized model weights that can be used for inference:
python qat-convert.py \
--input_bin <path-to-finetuned-pytorch.bin> \
--output <path-to-output-pseudo-quantized-pytorch.bin> \
--quant_type ternary \
--group_size -1
The converted model can then be loaded for inference in the same way as openbmb/BitCPM4-CANN-0.5Bβno special quantization libraries required.
Workflow Summary
βββββββββββββββββββββββββββββββββββ
β BitCPM4-CANN-0.5B-unquantized β β This model (QAT parameters + fake quantizer in modeling.py)
βββββββββββββββββ¬ββββββββββββββββββ
β
βΌ Fine-tune (DeepSpeed / LLaMA Factory / ...)
βββββββββββββββββββββββββββββββββββ
β Fine-tuned pytorch.bin β β Still contains un-fused QAT parameters
βββββββββββββββββ¬ββββββββββββββββββ
β
βΌ python qat-convert.py --quant_type ternary --group_size -1
βββββββββββββββββββββββββββββββββββ
β Pseudo-quantized pytorch.bin β β Ready for inference (same format as BitCPM4-CANN-0.5B)
βββββββββββββββββββββββββββββββββββ
Technical Background
BitCPM4-CANN uses a ternary quantizer that maps each weight group to {-1, 0, 1} scaled by a group-wise factor, trained with Straight-Through Estimator (STE) for gradient flow. The unquantized checkpoint preserves the full-precision latent weights alongside the quantizer parameters, allowing the model to continue learning under quantization constraints during fine-tuning.
For full technical details, please refer to our Technical Report.
Statement
- As a language model, BitCPM4-CANN generates content by learning from a vast amount of text.
- However, it does not possess the ability to comprehend or express personal opinions or value judgments.
- Any content generated by BitCPM4-CANN does not represent the viewpoints or positions of the model developers.
- Therefore, when using content generated by BitCPM4-CANN, users should take full responsibility for evaluating and verifying it on their own.
LICENSE
- This repository and BitCPM4-CANN models are released under the Apache-2.0 License.
Citation
- Please cite our technical report if you find our work valuable.
@article{bitcpm4cann,
title={{BitCPM-CANN}: Native 1.58-Bit Large Language Model Training on Ascend NPU},
author={BitCPM Team},
year={2026}
}
- Downloads last month
- -