guanwenyu1995 commited on
Commit
8a5049d
Β·
verified Β·
1 Parent(s): 31610ea

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +110 -110
README.md CHANGED
@@ -1,131 +1,131 @@
1
- # BitCPM4 Continue Pretrain Example
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
- This project provides scripts for continue pretraining **BitCPM4-CANN-1B-unquantized**.
4
 
5
- ## Environment Setup
6
 
7
- ### Docker Image
8
 
9
- Use the following Huawei NPU image:
10
 
11
- ```
12
- swr.cn-south-1.myhuaweicloud.com/ascendhub/mindspeed-llm:openeuler22.03-mindspeed-llm-2.3.0-a3-arm
13
- ```
 
 
14
 
15
- Other Huawei NPU images may also work but have not been fully tested.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
- ### Install Dependencies
18
 
19
- After entering the container, install the Python dependencies:
20
 
21
  ```bash
22
- pip install -r requirements.txt
 
 
 
 
23
  ```
24
 
25
- Dependency list:
26
 
27
- | Package | Version |
28
- | --- | --- |
29
- | transformers | 4.46.3 |
30
- | tokenizers | 0.20.3 |
31
- | accelerate | 1.1.1 |
32
- | deepspeed | 0.16.2 |
33
- | datasets | 3.1.0 |
34
- | safetensors | 0.4.5 |
35
- | pyarrow | 17.0.0 |
36
- | tensorboard | 2.18.0 |
37
 
38
- ## Dataset
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
- The test dataset used is [C4-Pro](https://huggingface.co/datasets/gair-prox/c4-pro), stored in parquet format after downloading.
41
 
42
- ## Usage
43
 
44
- Modify the path configuration in `run.sh`:
45
 
46
- ```bash
47
- MODEL_PATH="/path/to/BitCPM4-CANN-1B-unquantized/"
48
- DATA_PATH="/path/to/c4-pro/data/your_file.parquet"
49
- ```
 
50
 
51
- Then start training:
 
52
 
53
- ```bash
54
- bash run.sh
55
- ```
56
 
57
- By default, the script trains for 500 steps using 8 devices, DeepSpeed ZeRO-2, and bf16 precision.
58
-
59
- ## Training Results Reference
60
-
61
- Below is the loss curve for the first 100 steps (learning rate warmup covers the first 50 steps):
62
-
63
- | Step | Loss | Learning Rate | Epoch |
64
- | --- | --- | --- | --- |
65
- | 2 | 2.7920 | 1.60e-06 | 0.01 |
66
- | 4 | 2.8012 | 3.20e-06 | 0.02 |
67
- | 6 | 2.7984 | 4.80e-06 | 0.03 |
68
- | 8 | 2.7839 | 6.40e-06 | 0.04 |
69
- | 10 | 2.8084 | 8.00e-06 | 0.05 |
70
- | 12 | 2.8064 | 9.60e-06 | 0.06 |
71
- | 14 | 2.7994 | 1.12e-05 | 0.07 |
72
- | 16 | 2.7463 | 1.28e-05 | 0.08 |
73
- | 18 | 2.7580 | 1.44e-05 | 0.09 |
74
- | 20 | 2.8007 | 1.60e-05 | 0.10 |
75
- | 22 | 2.8916 | 1.76e-05 | 0.12 |
76
- | 24 | 2.8144 | 1.92e-05 | 0.13 |
77
- | 26 | 2.7723 | 2.08e-05 | 0.14 |
78
- | 28 | 2.7556 | 2.24e-05 | 0.15 |
79
- | 30 | 2.7414 | 2.40e-05 | 0.16 |
80
- | 32 | 2.7469 | 2.56e-05 | 0.17 |
81
- | 34 | 2.7428 | 2.72e-05 | 0.18 |
82
- | 36 | 2.7392 | 2.88e-05 | 0.19 |
83
- | 38 | 2.7132 | 3.04e-05 | 0.20 |
84
- | 40 | 2.7008 | 3.20e-05 | 0.21 |
85
- | 42 | 2.7547 | 3.36e-05 | 0.22 |
86
- | 44 | 2.7151 | 3.52e-05 | 0.23 |
87
- | 46 | 2.7119 | 3.68e-05 | 0.24 |
88
- | 48 | 2.7029 | 3.84e-05 | 0.25 |
89
- | 50 | 2.6803 | 4.00e-05 | 0.26 |
90
- | 52 | 2.6980 | 4.00e-05 | 0.27 |
91
- | 54 | 2.6923 | 4.00e-05 | 0.28 |
92
- | 56 | 2.7068 | 4.00e-05 | 0.29 |
93
- | 58 | 2.6965 | 4.00e-05 | 0.30 |
94
- | 60 | 2.7179 | 3.99e-05 | 0.31 |
95
- | 62 | 2.7119 | 3.99e-05 | 0.32 |
96
- | 64 | 2.7178 | 3.99e-05 | 0.33 |
97
- | 66 | 2.7069 | 3.99e-05 | 0.35 |
98
- | 68 | 2.6870 | 3.98e-05 | 0.36 |
99
- | 70 | 2.6775 | 3.98e-05 | 0.37 |
100
- | 72 | 2.7038 | 3.98e-05 | 0.38 |
101
- | 74 | 2.6924 | 3.97e-05 | 0.39 |
102
- | 76 | 2.7061 | 3.97e-05 | 0.40 |
103
- | 78 | 2.6929 | 3.96e-05 | 0.41 |
104
- | 80 | 2.6787 | 3.96e-05 | 0.42 |
105
- | 82 | 2.6749 | 3.95e-05 | 0.43 |
106
- | 84 | 2.6909 | 3.94e-05 | 0.44 |
107
- | 86 | 2.6893 | 3.94e-05 | 0.45 |
108
- | 88 | 2.6788 | 3.93e-05 | 0.46 |
109
- | 90 | 2.6831 | 3.92e-05 | 0.47 |
110
- | 92 | 2.7039 | 3.91e-05 | 0.48 |
111
- | 94 | 2.6619 | 3.91e-05 | 0.49 |
112
- | 96 | 2.6903 | 3.90e-05 | 0.50 |
113
- | 98 | 2.6993 | 3.89e-05 | 0.51 |
114
- | 100 | 2.6891 | 3.88e-05 | 0.52 |
115
- | 102 | 2.6739 | 3.87e-05 | 0.53 |
116
-
117
- > **Note:** BitCPM has its own training dataset and data mixture. It is expected that the loss continues to decrease when continue pretraining on open-source datasets.
118
-
119
- As shown in the table, the loss gradually decreases from ~2.79 to ~2.67, indicating a stable training process and that the model is learning normally.
120
-
121
- ## File Description
122
-
123
- | File | Description |
124
- | --- | --- |
125
- | `train.py` | Training script based on HuggingFace Trainer + DeepSpeed |
126
- | `run.sh` | Launch script with training hyperparameter configuration |
127
- | `train_sft.py` | Supervised fine-tuning script based on HuggingFace Trainer + DeepSpeed |
128
- | `run_sft.sh` | Launch script for SFT with hyperparameter configuration |
129
- | `ds_config.json` | DeepSpeed ZeRO-3 configuration (with CPU offload) |
130
- | `ds_config_z2.json` | DeepSpeed ZeRO-2 configuration (used by default) |
131
- | `requirements.txt` | Python dependency list |
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - zh
5
+ - en
6
+ pipeline_tag: text-generation
7
+ library_name: transformers
8
+ ---
9
+ <div align="center">
10
+ <img src="https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_logo.png?raw=true" width="500em" ></img>
11
+ </div>
12
+
13
+ <p align="center">
14
+ <a href="https://github.com/OpenBMB/MiniCPM/" target="_blank">GitHub Repo</a> |
15
+ <a href="TODO_TECHNICAL_REPORT_LINK" target="_blank">Technical Report</a>
16
+ </p>
17
+ <p align="center">
18
+ πŸ‘‹ Join us on <a href="https://discord.gg/3cGQn9b3YM" target="_blank">Discord</a> and <a href="https://github.com/OpenBMB/MiniCPM/blob/main/assets/wechat.jpg" target="_blank">WeChat</a>
19
+ </p>
20
+
21
+ ## Introduction
22
+
23
+ BitCPM4-CANN-1B-unquantized is the **unquantized QAT training checkpoint** of the BitCPM4-CANN-1B model. This model stores the raw quantization-aware training (QAT) parameters **before** fake-quantizer fusionβ€”the ternary fake quantizers are defined in `modeling.py` and applied during forward propagation.
24
+
25
+ > ⚠️ **This model is NOT intended for direct inference.** It is designed as the starting point for fine-tuning BitCPM4-CANN. If you need a model for inference, please use the pseudo-quantized version: [openbmb/BitCPM4-CANN-0.5B](https://huggingface.co/openbmb/BitCPM4-CANN-0.5B).
26
+
27
+ ### Key Characteristics
28
+
29
+ - 🎯 **Purpose**: Fine-tuning only. The model weights are un-fused QAT parameters with fake quantizers embedded in the `modeling.py` forward logic.
30
+ - πŸ”¬ **Ternary Fake Quantizer**: The forward pass in `modeling.py` contains ternary quantization logic (mapping weights to {-1, 0, 1} with group-wise scaling), which ensures the model continues learning under ternary constraints during fine-tuning.
31
+ - πŸ”„ **Post-Training Conversion**: After fine-tuning, the model can be converted to pseudo-quantized format using the provided `qat-convert.py` script.
32
+
33
+ ## BitCPM4-CANN Model Family
34
+
35
+ | Model | HuggingFace (Inference) | HuggingFace (Fine-tuning) |
36
+ |-------|-------------------------|---------------------------|
37
+ | BitCPM4-CANN-0.5B | [openbmb/BitCPM4-CANN-0.5B](https://huggingface.co/openbmb/BitCPM4-CANN-0.5B) | [openbmb/BitCPM4-CANN-0.5B-unquantized](https://huggingface.co/openbmb/BitCPM4-CANN-0.5B-unquantized) |
38
+ | BitCPM4-CANN-1B | [openbmb/BitCPM4-CANN-1B](https://huggingface.co/openbmb/BitCPM4-CANN-1B) | [openbmb/BitCPM4-CANN-1B-unquantized](https://huggingface.co/openbmb/BitCPM4-CANN-1B-unquantized) |
39
+ | BitCPM4-CANN-3B | [openbmb/BitCPM4-CANN-3B](https://huggingface.co/openbmb/BitCPM4-CANN-3B) | [openbmb/BitCPM4-CANN-3B-unquantized](https://huggingface.co/openbmb/BitCPM4-CANN-3B-unquantized) |
40
+ | BitCPM4-CANN-8B | [openbmb/BitCPM4-CANN-8B](https://huggingface.co/openbmb/BitCPM4-CANN-8B) | [openbmb/BitCPM4-CANN-8B-unquantized](https://huggingface.co/openbmb/BitCPM4-CANN-8B-unquantized) |
41
 
42
+ ## Usage
43
 
44
+ ### Fine-tuning
45
 
46
+ This model is designed for fine-tuning with frameworks that support custom modeling code. The critical requirement is that **the forward pass must go through the `modeling.py` file bundled with this model**, which contains the ternary fake quantizer logic. This ensures the model parameters remain compatible with ternary quantization constraints throughout fine-tuning.
47
 
48
+ #### Supported Fine-tuning Frameworks
49
 
50
+ - **DeepSpeed** (recommended): See [example](./example)
51
+ - **LLaMA Factory**: Supports custom model loading with `trust_remote_code=True`
52
+ - **Other Frameworks**: Any framework that supports HuggingFace-compatible model loading with custom modeling code
53
+
54
+ #### Important: Ensure Fake Quantizer is Active
55
 
56
+ When fine-tuning, you **must** ensure:
57
+ 1. Load the model with `trust_remote_code=True` so that the custom `modeling.py` (containing the ternary quantizer) is used.
58
+ 2. The forward pass during training goes through the ternary quantizer defined in `modeling.py`β€”do NOT replace or bypass the model's forward logic.
59
+
60
+ ```python
61
+ from transformers import AutoModelForCausalLM, AutoTokenizer
62
+
63
+ path = 'openbmb/BitCPM4-CANN-1B-unquantized'
64
+ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
65
+ model = AutoModelForCausalLM.from_pretrained(
66
+ path,
67
+ torch_dtype=torch.bfloat16,
68
+ trust_remote_code=True
69
+ )
70
+
71
+ # Proceed with your fine-tuning pipeline (DeepSpeed, LLaMA Factory, etc.)
72
+ # The ternary fake quantizer in modeling.py will be applied automatically during forward pass.
73
+ ```
74
 
75
+ ### Post-Fine-tuning Conversion
76
 
77
+ After fine-tuning is complete, use the `qat-convert.py` script to fuse the fake quantizer and produce the pseudo-quantized model weights that can be used for inference:
78
 
79
  ```bash
80
+ python qat-convert.py \
81
+ --input_bin <path-to-finetuned-pytorch.bin> \
82
+ --output <path-to-output-pseudo-quantized-pytorch.bin> \
83
+ --quant_type ternary \
84
+ --group_size -1
85
  ```
86
 
87
+ The converted model can then be loaded for inference in the same way as [openbmb/BitCPM4-CANN-1B](https://huggingface.co/openbmb/BitCPM4-CANN-1B)β€”no special quantization libraries required.
88
 
89
+ ## Workflow Summary
 
 
 
 
 
 
 
 
 
90
 
91
+ ```
92
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
93
+ β”‚ BitCPM4-CANN-1B-unquantized β”‚ ← This model (QAT parameters + fake quantizer in modeling.py)
94
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
95
+ β”‚
96
+ β–Ό Fine-tune (DeepSpeed / LLaMA Factory / ...)
97
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
98
+ β”‚ Fine-tuned pytorch.bin β”‚ ← Still contains un-fused QAT parameters
99
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
100
+ β”‚
101
+ β–Ό python qat-convert.py --quant_type ternary --group_size -1
102
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
103
+ β”‚ Pseudo-quantized pytorch.bin β”‚ ← Ready for inference (same format as BitCPM4-CANN-0.5B)
104
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
105
+ ```
106
 
107
+ ## Technical Background
108
 
109
+ BitCPM4-CANN uses a ternary quantizer that maps each weight group to {-1, 0, 1} scaled by a group-wise factor, trained with Straight-Through Estimator (STE) for gradient flow. The unquantized checkpoint preserves the full-precision latent weights alongside the quantizer parameters, allowing the model to continue learning under quantization constraints during fine-tuning.
110
 
111
+ For full technical details, please refer to our [Technical Report](TODO_TECHNICAL_REPORT_LINK).
112
 
113
+ ## Statement
114
+ - As a language model, BitCPM4-CANN generates content by learning from a vast amount of text.
115
+ - However, it does not possess the ability to comprehend or express personal opinions or value judgments.
116
+ - Any content generated by BitCPM4-CANN does not represent the viewpoints or positions of the model developers.
117
+ - Therefore, when using content generated by BitCPM4-CANN, users should take full responsibility for evaluating and verifying it on their own.
118
 
119
+ ## LICENSE
120
+ - This repository and BitCPM4-CANN models are released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
121
 
122
+ ## Citation
123
+ - Please cite our technical report if you find our work valuable.
 
124
 
125
+ ```bibtex
126
+ @article{bitcpm4cann,
127
+ title={{BitCPM-CANN}: Native 1.58-Bit Large Language Model Training on Ascend NPU},
128
+ author={BitCPM Team},
129
+ year={2026}
130
+ }
131
+ ```