Instructions to use openbmb/BitCPM4-CANN-1B-unquantized with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use openbmb/BitCPM4-CANN-1B-unquantized with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="openbmb/BitCPM4-CANN-1B-unquantized", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("openbmb/BitCPM4-CANN-1B-unquantized", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("openbmb/BitCPM4-CANN-1B-unquantized", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use openbmb/BitCPM4-CANN-1B-unquantized with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "openbmb/BitCPM4-CANN-1B-unquantized"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openbmb/BitCPM4-CANN-1B-unquantized",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/openbmb/BitCPM4-CANN-1B-unquantized

SGLang

How to use openbmb/BitCPM4-CANN-1B-unquantized with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "openbmb/BitCPM4-CANN-1B-unquantized" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openbmb/BitCPM4-CANN-1B-unquantized",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "openbmb/BitCPM4-CANN-1B-unquantized" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openbmb/BitCPM4-CANN-1B-unquantized",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use openbmb/BitCPM4-CANN-1B-unquantized with Docker Model Runner:
```
docker model run hf.co/openbmb/BitCPM4-CANN-1B-unquantized
```

GitHub Repo | Technical Report

👋 Join us on Discord and WeChat

Introduction

BitCPM4-CANN-1B-unquantized is the unquantized QAT training checkpoint of the BitCPM4-CANN-1B model. This model stores the raw quantization-aware training (QAT) parameters before fake-quantizer fusion—the ternary fake quantizers are defined in modeling.py and applied during forward propagation.

⚠️ This model is NOT intended for direct inference. It is designed as the starting point for fine-tuning BitCPM4-CANN. If you need a model for inference, please use the pseudo-quantized version: openbmb/BitCPM4-CANN-0.5B.

Key Characteristics

🎯 Purpose: Fine-tuning only. The model weights are un-fused QAT parameters with fake quantizers embedded in the modeling.py forward logic.
🔬 Ternary Fake Quantizer: The forward pass in modeling.py contains ternary quantization logic (mapping weights to {-1, 0, 1} with group-wise scaling), which ensures the model continues learning under ternary constraints during fine-tuning.
🔄 Post-Training Conversion: After fine-tuning, the model can be converted to pseudo-quantized format using the provided qat-convert.py script.

BitCPM4-CANN Model Family

Model	HuggingFace (Inference)	HuggingFace (Fine-tuning)
BitCPM4-CANN-0.5B	openbmb/BitCPM4-CANN-0.5B	openbmb/BitCPM4-CANN-0.5B-unquantized
BitCPM4-CANN-1B	openbmb/BitCPM4-CANN-1B	openbmb/BitCPM4-CANN-1B-unquantized
BitCPM4-CANN-3B	openbmb/BitCPM4-CANN-3B	openbmb/BitCPM4-CANN-3B-unquantized
BitCPM4-CANN-8B	openbmb/BitCPM4-CANN-8B	openbmb/BitCPM4-CANN-8B-unquantized

Usage

Fine-tuning

This model is designed for fine-tuning with frameworks that support custom modeling code. The critical requirement is that the forward pass must go through the modeling.py file bundled with this model, which contains the ternary fake quantizer logic. This ensures the model parameters remain compatible with ternary quantization constraints throughout fine-tuning.

Supported Fine-tuning Frameworks

DeepSpeed (recommended): See MiniCPM Fine-tuning Guide
LLaMA Factory: Supports custom model loading with trust_remote_code=True
Other Frameworks: Any framework that supports HuggingFace-compatible model loading with custom modeling code

Important: Ensure Fake Quantizer is Active

When fine-tuning, you must ensure:

Load the model with trust_remote_code=True so that the custom modeling.py (containing the ternary quantizer) is used.
The forward pass during training goes through the ternary quantizer defined in modeling.py—do NOT replace or bypass the model's forward logic.

from transformers import AutoModelForCausalLM, AutoTokenizer

path = 'openbmb/BitCPM4-CANN-1B-unquantized'
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
)

# Proceed with your fine-tuning pipeline (DeepSpeed, LLaMA Factory, etc.)
# The ternary fake quantizer in modeling.py will be applied automatically during forward pass.

Post-Fine-tuning Conversion

After fine-tuning is complete, use the qat-convert.py script to fuse the fake quantizer and produce the pseudo-quantized model weights that can be used for inference:

python qat-convert.py \
    --input_bin <path-to-finetuned-pytorch.bin> \
    --output <path-to-output-pseudo-quantized-pytorch.bin> \
    --quant_type ternary \
    --group_size -1

The converted model can then be loaded for inference in the same way as openbmb/BitCPM4-CANN-1B—no special quantization libraries required.

Workflow Summary

┌─────────────────────────────────┐
│  BitCPM4-CANN-1B-unquantized  │   ← This model (QAT parameters + fake quantizer in modeling.py)
└───────────────┬─────────────────┘
                │
                ▼  Fine-tune (DeepSpeed / LLaMA Factory / ...)
┌─────────────────────────────────┐
│   Fine-tuned pytorch.bin         │   ← Still contains un-fused QAT parameters
└───────────────┬─────────────────┘
                │
                ▼  python qat-convert.py --quant_type ternary --group_size -1
┌─────────────────────────────────┐
│  Pseudo-quantized pytorch.bin    │   ← Ready for inference (same format as BitCPM4-CANN-0.5B)
└─────────────────────────────────┘

Technical Background

BitCPM4-CANN uses a ternary quantizer that maps each weight group to {-1, 0, 1} scaled by a group-wise factor, trained with Straight-Through Estimator (STE) for gradient flow. The unquantized checkpoint preserves the full-precision latent weights alongside the quantizer parameters, allowing the model to continue learning under quantization constraints during fine-tuning.

For full technical details, please refer to our Technical Report.

Statement

As a language model, BitCPM4-CANN generates content by learning from a vast amount of text.
However, it does not possess the ability to comprehend or express personal opinions or value judgments.
Any content generated by BitCPM4-CANN does not represent the viewpoints or positions of the model developers.
Therefore, when using content generated by BitCPM4-CANN, users should take full responsibility for evaluating and verifying it on their own.

LICENSE

This repository and BitCPM4-CANN models are released under the Apache-2.0 License.

Citation

Please cite our technical report if you find our work valuable.

@article{bitcpm4cann,
  title={{BitCPM-CANN}: Native 1.58-Bit Large Language Model Training on Ascend NPU},
  author={BitCPM Team},
  year={2026}
}

Downloads last month: -