guanwenyu1995 commited on
Commit
413f4f3
·
verified ·
1 Parent(s): cc733b5

Upload folder using huggingface_hub

Browse files
example/README.md CHANGED
@@ -1,6 +1,6 @@
1
- # BitCPM4 Continue Pretrain Example
2
 
3
- This project provides scripts for continue pretraining **BitCPM4-CANN-1B-unquantized**.
4
 
5
  ## Environment Setup
6
 
@@ -35,11 +35,13 @@ Dependency list:
35
  | pyarrow | 17.0.0 |
36
  | tensorboard | 2.18.0 |
37
 
38
- ## Dataset
 
 
39
 
40
  The test dataset used is [C4-Pro](https://huggingface.co/datasets/gair-prox/c4-pro), stored in parquet format after downloading.
41
 
42
- ## Usage
43
 
44
  Modify the path configuration in `run.sh`:
45
 
@@ -54,76 +56,55 @@ Then start training:
54
  bash run.sh
55
  ```
56
 
57
- By default, the script trains for 500 steps using 8 devices, DeepSpeed ZeRO-2, and bf16 precision.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
  ## Training Results Reference
60
 
61
- Below is the loss curve for the first 100 steps (learning rate warmup covers the first 50 steps):
62
-
63
- | Step | Loss | Learning Rate | Epoch |
64
- | --- | --- | --- | --- |
65
- | 2 | 2.7920 | 1.60e-06 | 0.01 |
66
- | 4 | 2.8012 | 3.20e-06 | 0.02 |
67
- | 6 | 2.7984 | 4.80e-06 | 0.03 |
68
- | 8 | 2.7839 | 6.40e-06 | 0.04 |
69
- | 10 | 2.8084 | 8.00e-06 | 0.05 |
70
- | 12 | 2.8064 | 9.60e-06 | 0.06 |
71
- | 14 | 2.7994 | 1.12e-05 | 0.07 |
72
- | 16 | 2.7463 | 1.28e-05 | 0.08 |
73
- | 18 | 2.7580 | 1.44e-05 | 0.09 |
74
- | 20 | 2.8007 | 1.60e-05 | 0.10 |
75
- | 22 | 2.8916 | 1.76e-05 | 0.12 |
76
- | 24 | 2.8144 | 1.92e-05 | 0.13 |
77
- | 26 | 2.7723 | 2.08e-05 | 0.14 |
78
- | 28 | 2.7556 | 2.24e-05 | 0.15 |
79
- | 30 | 2.7414 | 2.40e-05 | 0.16 |
80
- | 32 | 2.7469 | 2.56e-05 | 0.17 |
81
- | 34 | 2.7428 | 2.72e-05 | 0.18 |
82
- | 36 | 2.7392 | 2.88e-05 | 0.19 |
83
- | 38 | 2.7132 | 3.04e-05 | 0.20 |
84
- | 40 | 2.7008 | 3.20e-05 | 0.21 |
85
- | 42 | 2.7547 | 3.36e-05 | 0.22 |
86
- | 44 | 2.7151 | 3.52e-05 | 0.23 |
87
- | 46 | 2.7119 | 3.68e-05 | 0.24 |
88
- | 48 | 2.7029 | 3.84e-05 | 0.25 |
89
- | 50 | 2.6803 | 4.00e-05 | 0.26 |
90
- | 52 | 2.6980 | 4.00e-05 | 0.27 |
91
- | 54 | 2.6923 | 4.00e-05 | 0.28 |
92
- | 56 | 2.7068 | 4.00e-05 | 0.29 |
93
- | 58 | 2.6965 | 4.00e-05 | 0.30 |
94
- | 60 | 2.7179 | 3.99e-05 | 0.31 |
95
- | 62 | 2.7119 | 3.99e-05 | 0.32 |
96
- | 64 | 2.7178 | 3.99e-05 | 0.33 |
97
- | 66 | 2.7069 | 3.99e-05 | 0.35 |
98
- | 68 | 2.6870 | 3.98e-05 | 0.36 |
99
- | 70 | 2.6775 | 3.98e-05 | 0.37 |
100
- | 72 | 2.7038 | 3.98e-05 | 0.38 |
101
- | 74 | 2.6924 | 3.97e-05 | 0.39 |
102
- | 76 | 2.7061 | 3.97e-05 | 0.40 |
103
- | 78 | 2.6929 | 3.96e-05 | 0.41 |
104
- | 80 | 2.6787 | 3.96e-05 | 0.42 |
105
- | 82 | 2.6749 | 3.95e-05 | 0.43 |
106
- | 84 | 2.6909 | 3.94e-05 | 0.44 |
107
- | 86 | 2.6893 | 3.94e-05 | 0.45 |
108
- | 88 | 2.6788 | 3.93e-05 | 0.46 |
109
- | 90 | 2.6831 | 3.92e-05 | 0.47 |
110
- | 92 | 2.7039 | 3.91e-05 | 0.48 |
111
- | 94 | 2.6619 | 3.91e-05 | 0.49 |
112
- | 96 | 2.6903 | 3.90e-05 | 0.50 |
113
- | 98 | 2.6993 | 3.89e-05 | 0.51 |
114
- | 100 | 2.6891 | 3.88e-05 | 0.52 |
115
- | 102 | 2.6739 | 3.87e-05 | 0.53 |
116
-
117
- > **Note:** BitCPM has its own training dataset and data mixture. It is expected that the loss continues to decrease when continue pretraining on open-source datasets.
118
-
119
- As shown in the table, the loss gradually decreases from ~2.79 to ~2.67, indicating a stable training process and that the model is learning normally.
120
 
121
  ## File Description
122
 
123
  | File | Description |
124
  | --- | --- |
125
- | `train.py` | Training script based on HuggingFace Trainer + DeepSpeed |
126
- | `run.sh` | Launch script with training hyperparameter configuration |
127
  | `train_sft.py` | Supervised fine-tuning script based on HuggingFace Trainer + DeepSpeed |
128
  | `run_sft.sh` | Launch script for SFT with hyperparameter configuration |
129
  | `ds_config.json` | DeepSpeed ZeRO-3 configuration (with CPU offload) |
 
1
+ # BitCPM4 Training Example
2
 
3
+ This project provides scripts for continue pretraining (CPT) and supervised fine-tuning (SFT) of **BitCPM4-CANN-1B-unquantized**.
4
 
5
  ## Environment Setup
6
 
 
35
  | pyarrow | 17.0.0 |
36
  | tensorboard | 2.18.0 |
37
 
38
+ ## Continue Pretrain (CPT)
39
+
40
+ ### Dataset
41
 
42
  The test dataset used is [C4-Pro](https://huggingface.co/datasets/gair-prox/c4-pro), stored in parquet format after downloading.
43
 
44
+ ### Usage
45
 
46
  Modify the path configuration in `run.sh`:
47
 
 
56
  bash run.sh
57
  ```
58
 
59
+ By default, the script trains for 100 steps using 8 devices, DeepSpeed ZeRO-2, and bf16 precision.
60
+
61
+ ## Supervised Fine-Tuning (SFT)
62
+
63
+ ### Dataset
64
+
65
+ The test dataset used is [UltraChat 200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), stored in parquet format after downloading.
66
+
67
+ ### Usage
68
+
69
+ Modify the path configuration in `run_sft.sh`:
70
+
71
+ ```bash
72
+ MODEL_PATH="/path/to/BitCPM4-CANN-1B-unquantized/"
73
+ DATA_PATH="/path/to/ultrachat_200k/data/your_file.parquet"
74
+ ```
75
+
76
+ Then start training:
77
+
78
+ ```bash
79
+ bash run_sft.sh
80
+ ```
81
+
82
+ By default, the script trains for 100 steps using 8 devices, DeepSpeed ZeRO-3 (with CPU offload), and bf16 precision. The maximum sequence length is 8192.
83
 
84
  ## Training Results Reference
85
 
86
+ Below are the loss curves from smoke tests on GPU and NPU for both CPT and SFT tasks:
87
+
88
+ | | GPU | NPU |
89
+ | --- | --- | --- |
90
+ | **CPT** | ![GPU Pretrain Loss](gpu_pretrain_loss.png) | ![NPU Pretrain Loss](npu_pretrain_loss.png) |
91
+ | **SFT** | ![GPU SFT Loss](gpu_sft_loss.png) | ![NPU SFT Loss](npu_sft_loss.png) |
92
+
93
+ Training log CSV files:
94
+
95
+ - [gpu_pretrain.csv](gpu_pretrain.csv)
96
+ - [npu_pretrain.csv](npu_pretrain.csv)
97
+ - [gpu_sft.csv](gpu_sft.csv)
98
+ - [npu_sft.csv](npu_sft.csv)
99
+
100
+ > **Note:** BitCPM has its own training dataset and data mixture. It is expected that the loss continues to decrease when training on open-source datasets.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
101
 
102
  ## File Description
103
 
104
  | File | Description |
105
  | --- | --- |
106
+ | `train.py` | Continue pretrain script based on HuggingFace Trainer + DeepSpeed |
107
+ | `run.sh` | Launch script for CPT with hyperparameter configuration |
108
  | `train_sft.py` | Supervised fine-tuning script based on HuggingFace Trainer + DeepSpeed |
109
  | `run_sft.sh` | Launch script for SFT with hyperparameter configuration |
110
  | `ds_config.json` | DeepSpeed ZeRO-3 configuration (with CPU offload) |
example/gpu_pretrain.csv ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ step,train/loss,train/grad_norm,train/learning_rate,train/epoch,train/train_runtime,train/train_samples_per_second,train/train_steps_per_second,train/total_flos,train/train_loss
2
+ 2,2.7920000553131104,0.03527498617768288,7.999999979801942e-06,0.010457516647875309,,,,,
3
+ 4,2.8011999130249023,0.03495891019701958,1.5999999959603883e-05,0.020915033295750618,,,,,
4
+ 6,2.7964000701904297,0.03271934762597084,2.4000000848900527e-05,0.0313725508749485,,,,,
5
+ 8,2.763700008392334,0.024968057870864868,3.199999991920777e-05,0.041830066591501236,,,,,
6
+ 10,3.281599998474121,0.31758183240890503,3.9999998989515007e-05,0.05228758230805397,,,,,
7
+ 12,2.941200017929077,0.044055406004190445,3.995128281530924e-05,0.062745101749897,,,,,
8
+ 14,2.851799964904785,0.03649706766009331,3.9805359847377986e-05,0.07320261746644974,,,,,
9
+ 16,2.7869999408721924,0.022624235600233078,3.9562950405525044e-05,0.08366013318300247,,,,,
10
+ 18,2.7825000286102295,0.021830420941114426,3.922523319488391e-05,0.0941176488995552,,,,,
11
+ 20,2.7857000827789307,0.01685911975800991,3.87938525818754e-05,0.10457516461610794,,,,,
12
+ 22,2.7571001052856445,0.01572061888873577,3.827090768027119e-05,0.11503268033266068,,,,,
13
+ 24,2.762399911880493,0.016891509294509888,3.7658952351193875e-05,0.125490203499794,,,,,
14
+ 26,2.7411000728607178,0.015683824196457863,3.6960962461307645e-05,0.13594771921634674,,,,,
15
+ 28,2.733099937438965,0.012847283855080605,3.6180339520797133e-05,0.14640523493289948,,,,,
16
+ 30,2.723400115966797,0.015209181234240532,3.532088885549456e-05,0.1568627506494522,,,,,
17
+ 32,2.7342000007629395,0.01241038367152214,3.4386797779006884e-05,0.16732026636600494,,,,,
18
+ 34,2.7321999073028564,0.012879018671810627,3.338261376484297e-05,0.17777778208255768,,,,,
19
+ 36,2.7314000129699707,0.013242729939520359,3.231322989449836e-05,0.1882352977991104,,,,,
20
+ 38,2.7065999507904053,0.01113435160368681,3.118385939160362e-05,0.19869281351566315,,,,,
21
+ 40,2.6958999633789062,0.012413726188242435,2.9999999242136255e-05,0.20915032923221588,,,,,
22
+ 42,2.7516000270843506,0.011661508120596409,2.8767422918463126e-05,0.21960784494876862,,,,,
23
+ 44,2.713099956512451,0.012248368933796883,2.749213126662653e-05,0.23006536066532135,,,,,
24
+ 46,2.7102999687194824,0.011450185440480709,2.6180339773418382e-05,0.24052287638187408,,,,,
25
+ 48,2.7021000385284424,0.011155751533806324,2.483843854861334e-05,0.250980406999588,,,,,
26
+ 50,2.680500030517578,0.010021247901022434,2.3472963221138343e-05,0.26143792271614075,,,,,
27
+ 52,2.699199914932251,0.010751751251518726,2.2090569473220967e-05,0.2718954384326935,,,,,
28
+ 54,2.694200038909912,0.010503941215574741,2.0697989384643734e-05,0.2823529541492462,,,,,
29
+ 56,2.7091000080108643,0.010059370659291744,1.9302009604871273e-05,0.29281046986579895,,,,,
30
+ 58,2.699399948120117,0.012161476537585258,1.7909431335283443e-05,0.3032679855823517,,,,,
31
+ 60,2.7216999530792236,0.010671027936041355,1.6527035768376663e-05,0.3137255012989044,,,,,
32
+ 62,2.7158000469207764,0.010463157668709755,1.516156225989107e-05,0.32418301701545715,,,,,
33
+ 64,2.7214999198913574,0.010665320791304111,1.3819660125591327e-05,0.3346405327320099,,,,,
34
+ 66,2.7116000652313232,0.01046629250049591,1.2507867722888477e-05,0.3450980484485626,,,,,
35
+ 68,2.6923000812530518,0.010609752498567104,1.1232576980546582e-05,0.35555556416511536,,,,,
36
+ 70,2.6830999851226807,0.009290814399719238,9.999999747378752e-06,0.3660130798816681,,,,,
37
+ 72,2.7093000411987305,0.010727670043706894,8.816142326395493e-06,0.3764705955982208,,,,,
38
+ 74,2.698699951171875,0.0109737953171134,7.686770914006047e-06,0.38692811131477356,,,,,
39
+ 76,2.712599992752075,0.010320967063307762,6.61738795315614e-06,0.3973856270313263,,,,,
40
+ 78,2.6993000507354736,0.009841523133218288,5.613203938992228e-06,0.40784314274787903,,,,,
41
+ 80,2.6861000061035156,0.010179675184190273,4.6791110435151495e-06,0.41830065846443176,,,,,
42
+ 82,2.6828999519348145,0.009790077805519104,3.819659923465224e-06,0.4287581741809845,,,,,
43
+ 84,2.699199914932251,0.010508442297577858,3.03903811982309e-06,0.43921568989753723,,,,,
44
+ 86,2.6988000869750977,0.009589221328496933,2.3410482299368596e-06,0.44967320561408997,,,,,
45
+ 88,2.688499927520752,0.010065913200378418,1.7290908544964623e-06,0.4601307213306427,,,,,
46
+ 90,2.6928999423980713,0.010363687761127949,1.206147544507985e-06,0.47058823704719543,,,,,
47
+ 92,2.714200019836426,0.010142815299332142,7.74766078848188e-07,0.48104575276374817,,,,,
48
+ 94,2.672300100326538,0.009833029471337795,4.370479871340649e-07,0.4915032684803009,,,,,
49
+ 96,2.7018001079559326,0.009937037713825703,1.9463863054625108e-07,0.501960813999176,,,,,
50
+ 98,2.7121999263763428,0.009417451918125153,4.8718995060426096e-08,0.5124183297157288,,,,,
51
+ 100,2.7028000354766846,0.009256146848201752,0.0,0.5228758454322815,365.8839111328125,139.93499755859375,0.27300000190734863,4.629706395531346e+17,2.7395541667938232
example/gpu_pretrain_loss.png ADDED
example/gpu_sft.csv ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ step,train/loss,train/grad_norm,train/learning_rate,train/epoch,train/train_runtime,train/train_samples_per_second,train/train_steps_per_second,train/total_flos,train/train_loss
2
+ 2,1.1492999792099,0.6216375231742859,1.9999999949504854e-06,0.0004617871018126607,,,,,
3
+ 4,1.0979000329971313,0.681877851486206,3.999999989900971e-06,0.0009235742036253214,,,,,
4
+ 6,1.1269999742507935,0.784303605556488,6.000000212225132e-06,0.001385361305437982,,,,,
5
+ 8,1.0542000532150269,0.8737029433250427,7.999999979801942e-06,0.0018471484072506428,,,,,
6
+ 10,1.2440999746322632,0.7068291902542114,9.999999747378752e-06,0.0023089356254786253,,,,,
7
+ 12,1.2925000190734863,0.6821666955947876,1.2000000424450263e-05,0.002770722610875964,,,,,
8
+ 14,1.0843000411987305,0.525643527507782,1.4000000192027073e-05,0.0032325098291039467,,,,,
9
+ 16,1.0961999893188477,0.43757057189941406,1.5999999959603883e-05,0.0036942968145012856,,,,,
10
+ 18,1.0614999532699585,0.46141618490219116,1.8000000636675395e-05,0.004156084265559912,,,,,
11
+ 20,1.332900047302246,0.715879499912262,1.9999999494757503e-05,0.004617871250957251,,,,,
12
+ 22,1.2070000171661377,0.5926885008811951,1.996917308133561e-05,0.0050796582363545895,,,,,
13
+ 24,1.2043999433517456,0.5833240747451782,1.9876883015967906e-05,0.005541445221751928,,,,,
14
+ 26,1.0740000009536743,0.44734400510787964,1.9723698642337695e-05,0.0060032326728105545,,,,,
15
+ 28,1.1162999868392944,0.3701137900352478,1.9510565834934823e-05,0.006465019658207893,,,,,
16
+ 30,1.0454000234603882,0.43832680583000183,1.9238796085119247e-05,0.006926806643605232,,,,,
17
+ 32,1.124899983406067,0.4591037631034851,1.8910064682131633e-05,0.007388593629002571,,,,,
18
+ 34,1.0686999559402466,0.3873400390148163,1.8526401618146338e-05,0.00785038061439991,,,,,
19
+ 36,1.0291999578475952,0.40313437581062317,1.8090169760398567e-05,0.008312168531119823,,,,,
20
+ 38,1.1052000522613525,0.3735405504703522,1.7604059394216165e-05,0.008773955516517162,,,,,
21
+ 40,1.1555999517440796,0.3818407654762268,1.7071068214136176e-05,0.009235742501914501,,,,,
22
+ 42,1.0235999822616577,0.4255191683769226,1.6494481315021403e-05,0.00969752948731184,,,,,
23
+ 44,1.0364999771118164,0.4794503152370453,1.5877853002166376e-05,0.010159316472709179,,,,,
24
+ 46,1.1344000101089478,0.37273937463760376,1.5224985872919206e-05,0.010621103458106518,,,,,
25
+ 48,1.0866999626159668,0.417492538690567,1.453990535082994e-05,0.011082890443503857,,,,,
26
+ 50,1.1038000583648682,0.35408055782318115,1.3826834219798911e-05,0.01154467836022377,,,,,
27
+ 52,1.1478999853134155,0.3930828273296356,1.3090169886709191e-05,0.012006465345621109,,,,,
28
+ 54,1.1858999729156494,0.3965947926044464,1.2334453458606731e-05,0.012468252331018448,,,,,
29
+ 56,1.0096999406814575,0.3860221207141876,1.1564344276848715e-05,0.012930039316415787,,,,,
30
+ 58,1.114799976348877,0.44393691420555115,1.0784590813273098e-05,0.013391826301813126,,,,,
31
+ 60,1.079300045967102,0.3605058789253235,9.999999747378752e-06,0.013853613287210464,,,,,
32
+ 62,1.1766999959945679,0.40689122676849365,9.215408681484405e-06,0.014315400272607803,,,,,
33
+ 64,1.1075999736785889,0.4002344310283661,8.435655217908788e-06,0.014777187258005142,,,,,
34
+ 66,1.1866999864578247,0.46947163343429565,7.665546036150772e-06,0.015238975174725056,,,,,
35
+ 68,1.0311000347137451,0.3296957314014435,6.909830062795663e-06,0.01570076122879982,,,,,
36
+ 70,1.1088999509811401,0.33858785033226013,6.173165729705943e-06,0.01616254821419716,,,,,
37
+ 72,1.0720000267028809,0.3967427909374237,5.460095053422265e-06,0.016624337062239647,,,,,
38
+ 74,1.1460000276565552,0.41202062368392944,4.7750145313329995e-06,0.017086124047636986,,,,,
39
+ 76,1.0425000190734863,0.38334518671035767,4.1221474020858295e-06,0.017547911033034325,,,,,
40
+ 78,0.9154000282287598,0.40649303793907166,3.505519543978153e-06,0.018009698018431664,,,,,
41
+ 80,1.1110999584197998,0.35371580719947815,2.9289321901160292e-06,0.018471485003829002,,,,,
42
+ 82,1.1672999858856201,0.3381657302379608,2.3959403279150138e-06,0.01893327198922634,,,,,
43
+ 84,1.2374000549316406,0.3815234303474426,1.909829961732612e-06,0.01939505897462368,,,,,
44
+ 86,1.2151000499725342,0.38446080684661865,1.4735983313585166e-06,0.01985684596002102,,,,,
45
+ 88,1.163100004196167,0.40419140458106995,1.0899348126258701e-06,0.020318632945418358,,,,,
46
+ 90,1.1883000135421753,0.4011874198913574,7.612046601934708e-07,0.020780419930815697,,,,,
47
+ 92,1.1526999473571777,0.3836020231246948,4.894348535344761e-07,0.021242206916213036,,,,,
48
+ 94,1.15339994430542,0.452364057302475,2.7630079557638965e-07,0.021703993901610374,,,,,
49
+ 96,1.062000036239624,0.3502688705921173,1.2311659247643547e-07,0.022165780887007713,,,,,
50
+ 98,1.0271999835968018,0.4022065997123718,3.0826662111849146e-08,0.022627567872405052,,,,,
51
+ 100,1.0283000469207764,0.38241174817085266,0.0,0.02308935672044754,183.9481964111328,8.697999954223633,0.5440000295639038,1862467846144.0,1.1177252531051636
example/gpu_sft_loss.png ADDED
example/npu_pretrain.csv ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ step,train/loss,train/grad_norm,train/learning_rate,train/epoch,train/train_runtime,train/train_samples_per_second,train/train_steps_per_second,train/total_flos,train/train_loss
2
+ 2,2.7920000553131104,0.035306449979543686,7.999999979801942e-06,0.010457516647875309,,,,,
3
+ 4,2.8011999130249023,0.03491510450839996,1.5999999959603883e-05,0.020915033295750618,,,,,
4
+ 6,2.7964000701904297,0.032717395573854446,2.4000000848900527e-05,0.0313725508749485,,,,,
5
+ 8,2.763700008392334,0.024953875690698624,3.199999991920777e-05,0.041830066591501236,,,,,
6
+ 10,3.2811999320983887,0.3170815408229828,3.9999998989515007e-05,0.05228758230805397,,,,,
7
+ 12,2.9409000873565674,0.04423849284648895,3.995128281530924e-05,0.062745101749897,,,,,
8
+ 14,2.851900100708008,0.03667925298213959,3.9805359847377986e-05,0.07320261746644974,,,,,
9
+ 16,2.7869999408721924,0.022814607247710228,3.9562950405525044e-05,0.08366013318300247,,,,,
10
+ 18,2.782599925994873,0.021528413519263268,3.922523319488391e-05,0.0941176488995552,,,,,
11
+ 20,2.785599946975708,0.017014438286423683,3.87938525818754e-05,0.10457516461610794,,,,,
12
+ 22,2.7571001052856445,0.015719758346676826,3.827090768027119e-05,0.11503268033266068,,,,,
13
+ 24,2.762399911880493,0.016948623582720757,3.7658952351193875e-05,0.125490203499794,,,,,
14
+ 26,2.7411000728607178,0.015535997226834297,3.6960962461307645e-05,0.13594771921634674,,,,,
15
+ 28,2.7330000400543213,0.012748735956847668,3.6180339520797133e-05,0.14640523493289948,,,,,
16
+ 30,2.723299980163574,0.014809778891503811,3.532088885549456e-05,0.1568627506494522,,,,,
17
+ 32,2.7342000007629395,0.01219236571341753,3.4386797779006884e-05,0.16732026636600494,,,,,
18
+ 34,2.7321999073028564,0.012785322032868862,3.338261376484297e-05,0.17777778208255768,,,,,
19
+ 36,2.7314000129699707,0.012986919842660427,3.231322989449836e-05,0.1882352977991104,,,,,
20
+ 38,2.7065999507904053,0.01096824835985899,3.118385939160362e-05,0.19869281351566315,,,,,
21
+ 40,2.6958999633789062,0.012387535534799099,2.9999999242136255e-05,0.20915032923221588,,,,,
22
+ 42,2.751499891281128,0.011586200445890427,2.8767422918463126e-05,0.21960784494876862,,,,,
23
+ 44,2.713099956512451,0.011821281164884567,2.749213126662653e-05,0.23006536066532135,,,,,
24
+ 46,2.7102999687194824,0.01147585827857256,2.6180339773418382e-05,0.24052287638187408,,,,,
25
+ 48,2.7019999027252197,0.011368263512849808,2.483843854861334e-05,0.250980406999588,,,,,
26
+ 50,2.680500030517578,0.009935515932738781,2.3472963221138343e-05,0.26143792271614075,,,,,
27
+ 52,2.6993000507354736,0.0109846917912364,2.2090569473220967e-05,0.2718954384326935,,,,,
28
+ 54,2.6940999031066895,0.010465175844728947,2.0697989384643734e-05,0.2823529541492462,,,,,
29
+ 56,2.7091000080108643,0.01009758748114109,1.9302009604871273e-05,0.29281046986579895,,,,,
30
+ 58,2.69950008392334,0.01249368954449892,1.7909431335283443e-05,0.3032679855823517,,,,,
31
+ 60,2.7216999530792236,0.01051376760005951,1.6527035768376663e-05,0.3137255012989044,,,,,
32
+ 62,2.7158000469207764,0.01054943073540926,1.516156225989107e-05,0.32418301701545715,,,,,
33
+ 64,2.7214999198913574,0.01076149195432663,1.3819660125591327e-05,0.3346405327320099,,,,,
34
+ 66,2.7116000652313232,0.010380392894148827,1.2507867722888477e-05,0.3450980484485626,,,,,
35
+ 68,2.6923000812530518,0.010425001382827759,1.1232576980546582e-05,0.35555556416511536,,,,,
36
+ 70,2.683199882507324,0.00925016961991787,9.999999747378752e-06,0.3660130798816681,,,,,
37
+ 72,2.7093000411987305,0.01072422880679369,8.816142326395493e-06,0.3764705955982208,,,,,
38
+ 74,2.6988000869750977,0.011063243262469769,7.686770914006047e-06,0.38692811131477356,,,,,
39
+ 76,2.7125000953674316,0.01013101264834404,6.61738795315614e-06,0.3973856270313263,,,,,
40
+ 78,2.6993000507354736,0.009940676391124725,5.613203938992228e-06,0.40784314274787903,,,,,
41
+ 80,2.6861000061035156,0.01050259917974472,4.6791110435151495e-06,0.41830065846443176,,,,,
42
+ 82,2.6828999519348145,0.009912634268403053,3.819659923465224e-06,0.4287581741809845,,,,,
43
+ 84,2.699199914932251,0.010668900795280933,3.03903811982309e-06,0.43921568989753723,,,,,
44
+ 86,2.698899984359741,0.009650414809584618,2.3410482299368596e-06,0.44967320561408997,,,,,
45
+ 88,2.6884000301361084,0.01006452739238739,1.7290908544964623e-06,0.4601307213306427,,,,,
46
+ 90,2.6928999423980713,0.010409764014184475,1.206147544507985e-06,0.47058823704719543,,,,,
47
+ 92,2.714200019836426,0.009937116876244545,7.74766078848188e-07,0.48104575276374817,,,,,
48
+ 94,2.672300100326538,0.009728306904435158,4.370479871340649e-07,0.4915032684803009,,,,,
49
+ 96,2.7018001079559326,0.010098566301167011,1.9463863054625108e-07,0.501960813999176,,,,,
50
+ 98,2.7123000621795654,0.009524320252239704,4.8718995060426096e-08,0.5124183297157288,,,,,
51
+ 100,2.7028000354766846,0.009290286339819431,0.0,0.5228758454322815,788.0635986328125,64.96900177001953,0.12700000405311584,4.629706395531346e+17,2.739542245864868
example/npu_pretrain_loss.png ADDED
example/npu_sft.csv ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ step,train/loss,train/grad_norm,train/learning_rate,train/epoch,train/train_runtime,train/train_samples_per_second,train/train_steps_per_second,train/total_flos,train/train_loss
2
+ 2,1.1491999626159668,0.6218180060386658,1.9999999949504854e-06,0.0004617871018126607,,,,,
3
+ 4,1.0981999635696411,0.6825665235519409,3.999999989900971e-06,0.0009235742036253214,,,,,
4
+ 6,1.1269999742507935,0.7838642001152039,6.000000212225132e-06,0.001385361305437982,,,,,
5
+ 8,1.0542000532150269,0.8744276762008667,7.999999979801942e-06,0.0018471484072506428,,,,,
6
+ 10,1.2441999912261963,0.7064258456230164,9.999999747378752e-06,0.0023089356254786253,,,,,
7
+ 12,1.2927000522613525,0.6829814910888672,1.2000000424450263e-05,0.002770722610875964,,,,,
8
+ 14,1.0844999551773071,0.5265647172927856,1.4000000192027073e-05,0.0032325098291039467,,,,,
9
+ 16,1.0963000059127808,0.4373657703399658,1.5999999959603883e-05,0.0036942968145012856,,,,,
10
+ 18,1.0615999698638916,0.46220508217811584,1.8000000636675395e-05,0.004156084265559912,,,,,
11
+ 20,1.3325999975204468,0.7157824039459229,1.9999999494757503e-05,0.004617871250957251,,,,,
12
+ 22,1.2070000171661377,0.5933427214622498,1.996917308133561e-05,0.0050796582363545895,,,,,
13
+ 24,1.2044999599456787,0.5816172957420349,1.9876883015967906e-05,0.005541445221751928,,,,,
14
+ 26,1.0740000009536743,0.4489712119102478,1.9723698642337695e-05,0.0060032326728105545,,,,,
15
+ 28,1.1164000034332275,0.3696516752243042,1.9510565834934823e-05,0.006465019658207893,,,,,
16
+ 30,1.045199990272522,0.4376335144042969,1.9238796085119247e-05,0.006926806643605232,,,,,
17
+ 32,1.1247999668121338,0.4589230716228485,1.8910064682131633e-05,0.007388593629002571,,,,,
18
+ 34,1.0688999891281128,0.3879022002220154,1.8526401618146338e-05,0.00785038061439991,,,,,
19
+ 36,1.0292999744415283,0.4027869403362274,1.8090169760398567e-05,0.008312168531119823,,,,,
20
+ 38,1.1052000522613525,0.37394437193870544,1.7604059394216165e-05,0.008773955516517162,,,,,
21
+ 40,1.1557999849319458,0.3808683753013611,1.7071068214136176e-05,0.009235742501914501,,,,,
22
+ 42,1.0232000350952148,0.4252733886241913,1.6494481315021403e-05,0.00969752948731184,,,,,
23
+ 44,1.0364999771118164,0.48068660497665405,1.5877853002166376e-05,0.010159316472709179,,,,,
24
+ 46,1.1340999603271484,0.37313926219940186,1.5224985872919206e-05,0.010621103458106518,,,,,
25
+ 48,1.0866999626159668,0.4175492823123932,1.453990535082994e-05,0.011082890443503857,,,,,
26
+ 50,1.1039999723434448,0.35443660616874695,1.3826834219798911e-05,0.01154467836022377,,,,,
27
+ 52,1.1480000019073486,0.39232146739959717,1.3090169886709191e-05,0.012006465345621109,,,,,
28
+ 54,1.1861000061035156,0.396918922662735,1.2334453458606731e-05,0.012468252331018448,,,,,
29
+ 56,1.0096999406814575,0.3885609209537506,1.1564344276848715e-05,0.012930039316415787,,,,,
30
+ 58,1.114799976348877,0.4421806335449219,1.0784590813273098e-05,0.013391826301813126,,,,,
31
+ 60,1.0795999765396118,0.36081990599632263,9.999999747378752e-06,0.013853613287210464,,,,,
32
+ 62,1.1764999628067017,0.4062329828739166,9.215408681484405e-06,0.014315400272607803,,,,,
33
+ 64,1.107200026512146,0.39982733130455017,8.435655217908788e-06,0.014777187258005142,,,,,
34
+ 66,1.1868000030517578,0.4688170254230499,7.665546036150772e-06,0.015238975174725056,,,,,
35
+ 68,1.0312999486923218,0.3301626741886139,6.909830062795663e-06,0.01570076122879982,,,,,
36
+ 70,1.1089999675750732,0.3377252221107483,6.173165729705943e-06,0.01616254821419716,,,,,
37
+ 72,1.0716999769210815,0.39666977524757385,5.460095053422265e-06,0.016624337062239647,,,,,
38
+ 74,1.1461999416351318,0.4125552177429199,4.7750145313329995e-06,0.017086124047636986,,,,,
39
+ 76,1.042199969291687,0.3825180232524872,4.1221474020858295e-06,0.017547911033034325,,,,,
40
+ 78,0.9157000184059143,0.4063441753387451,3.505519543978153e-06,0.018009698018431664,,,,,
41
+ 80,1.1110999584197998,0.35289037227630615,2.9289321901160292e-06,0.018471485003829002,,,,,
42
+ 82,1.167199969291687,0.33720290660858154,2.3959403279150138e-06,0.01893327198922634,,,,,
43
+ 84,1.2375999689102173,0.38099613785743713,1.909829961732612e-06,0.01939505897462368,,,,,
44
+ 86,1.2151999473571777,0.3848689794540405,1.4735983313585166e-06,0.01985684596002102,,,,,
45
+ 88,1.1628999710083008,0.40408074855804443,1.0899348126258701e-06,0.020318632945418358,,,,,
46
+ 90,1.1884000301361084,0.4015007019042969,7.612046601934708e-07,0.020780419930815697,,,,,
47
+ 92,1.152500033378601,0.38306349515914917,4.894348535344761e-07,0.021242206916213036,,,,,
48
+ 94,1.154099941253662,0.45273807644844055,2.7630079557638965e-07,0.021703993901610374,,,,,
49
+ 96,1.0618000030517578,0.35036078095436096,1.2311659247643547e-07,0.022165780887007713,,,,,
50
+ 98,1.0270999670028687,0.40208569169044495,3.0826662111849146e-08,0.022627567872405052,,,,,
51
+ 100,1.0285999774932861,0.38247284293174744,0.0,0.02308935672044754,728.7083129882812,2.196000099182129,0.13699999451637268,1862467846144.0,1.117748498916626
example/npu_sft_loss.png ADDED
example/run.sh CHANGED
@@ -1,6 +1,6 @@
1
  #!/bin/bash
2
 
3
- MODEL_PATH="/model/BitCPM/BitCPM4-CANN-1B-unquantized/"
4
  DATA_PATH="/dataset/c4-pro/data/000_1_7.parquet"
5
  OUTPUT_DIR="./output"
6
  DS_CONFIG="./ds_config_z2.json"
@@ -11,7 +11,8 @@ GRAD_ACCUM_STEPS=8
11
  MAX_SEQ_LENGTH=1024
12
 
13
  export ASCEND_RT_VISIBLE_DEVICES=8,9,10,11,12,13,14,15
14
-
 
15
  torchrun --nproc_per_node=$NUM_GPUS train.py \
16
  --model_name_or_path $MODEL_PATH \
17
  --data_path $DATA_PATH \
@@ -19,7 +20,7 @@ torchrun --nproc_per_node=$NUM_GPUS train.py \
19
  --output_dir $OUTPUT_DIR \
20
  --per_device_train_batch_size $BATCH_SIZE_PER_GPU \
21
  --gradient_accumulation_steps $GRAD_ACCUM_STEPS \
22
- --max_steps 500 \
23
  --learning_rate 4e-5 \
24
  --lr_scheduler_type cosine \
25
  --warmup_ratio 0.1 \
@@ -33,5 +34,5 @@ torchrun --nproc_per_node=$NUM_GPUS train.py \
33
  --seed 42 \
34
  --dataloader_num_workers 4 \
35
  --report_to tensorboard \
36
- --logging_dir /data/tensorboard/ \
37
  --gradient_checkpointing_kwargs '{"use_reentrant": false}'
 
1
  #!/bin/bash
2
 
3
+ MODEL_PATH="/model/BitCPM4-CANN-1B-unquantized"
4
  DATA_PATH="/dataset/c4-pro/data/000_1_7.parquet"
5
  OUTPUT_DIR="./output"
6
  DS_CONFIG="./ds_config_z2.json"
 
11
  MAX_SEQ_LENGTH=1024
12
 
13
  export ASCEND_RT_VISIBLE_DEVICES=8,9,10,11,12,13,14,15
14
+ export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
15
+ export DS_SKIP_CUDA_CHECK=1
16
  torchrun --nproc_per_node=$NUM_GPUS train.py \
17
  --model_name_or_path $MODEL_PATH \
18
  --data_path $DATA_PATH \
 
20
  --output_dir $OUTPUT_DIR \
21
  --per_device_train_batch_size $BATCH_SIZE_PER_GPU \
22
  --gradient_accumulation_steps $GRAD_ACCUM_STEPS \
23
+ --max_steps 100 \
24
  --learning_rate 4e-5 \
25
  --lr_scheduler_type cosine \
26
  --warmup_ratio 0.1 \
 
34
  --seed 42 \
35
  --dataloader_num_workers 4 \
36
  --report_to tensorboard \
37
+ --logging_dir /data/tensorboard/pretrain \
38
  --gradient_checkpointing_kwargs '{"use_reentrant": false}'
example/run_sft.sh CHANGED
@@ -1,16 +1,18 @@
1
  #!/bin/bash
2
 
3
- MODEL_PATH="/model/BitCPM/BitCPM4-CANN-3B-unquantized/"
4
- DATA_PATH=""
5
  OUTPUT_DIR="./output_sft"
6
  DS_CONFIG="./ds_config.json"
7
 
8
  NUM_GPUS=8
9
  BATCH_SIZE_PER_GPU=2
10
  GRAD_ACCUM_STEPS=1
11
- MAX_SEQ_LENGTH=4096
12
 
13
  export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
 
 
14
 
15
  torchrun --nproc_per_node=$NUM_GPUS train_sft.py \
16
  --model_name_or_path $MODEL_PATH \
@@ -19,10 +21,10 @@ torchrun --nproc_per_node=$NUM_GPUS train_sft.py \
19
  --output_dir $OUTPUT_DIR \
20
  --per_device_train_batch_size $BATCH_SIZE_PER_GPU \
21
  --gradient_accumulation_steps $GRAD_ACCUM_STEPS \
22
- --num_train_epochs 3 \
23
  --learning_rate 2e-5 \
24
  --lr_scheduler_type cosine \
25
- --warmup_ratio 0.03 \
26
  --weight_decay 0.0 \
27
  --logging_steps 2 \
28
  --save_steps 500 \
 
1
  #!/bin/bash
2
 
3
+ MODEL_PATH="/model/BitCPM4-CANN-1B-unquantized"
4
+ DATA_PATH="/dataset/HuggingFaceH4_ultrachat_200k/data/train_sft-00000-of-00003-a3ecf92756993583.parquet"
5
  OUTPUT_DIR="./output_sft"
6
  DS_CONFIG="./ds_config.json"
7
 
8
  NUM_GPUS=8
9
  BATCH_SIZE_PER_GPU=2
10
  GRAD_ACCUM_STEPS=1
11
+ MAX_SEQ_LENGTH=8192
12
 
13
  export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
14
+ export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
15
+ export DS_SKIP_CUDA_CHECK=1
16
 
17
  torchrun --nproc_per_node=$NUM_GPUS train_sft.py \
18
  --model_name_or_path $MODEL_PATH \
 
21
  --output_dir $OUTPUT_DIR \
22
  --per_device_train_batch_size $BATCH_SIZE_PER_GPU \
23
  --gradient_accumulation_steps $GRAD_ACCUM_STEPS \
24
+ --max_steps 100 \
25
  --learning_rate 2e-5 \
26
  --lr_scheduler_type cosine \
27
+ --warmup_ratio 0.2 \
28
  --weight_decay 0.0 \
29
  --logging_steps 2 \
30
  --save_steps 500 \