PCL-Reasoner commited on
Commit
c0c37be
·
verified ·
1 Parent(s): f27df66

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -257
README.md CHANGED
@@ -12,267 +12,10 @@ https://github.com/PCL-Reasoner/V1
12
 
13
  https://openi.pcl.ac.cn/PCL-Reasoner/V1
14
 
15
- ## Development Guide
16
-
17
- ### 1. Model Files
18
- PCL-Reasoner-V1 is fine-tuned from Qwen2.5-32B-Base using MindFormers. Key files include:
19
-
20
- ​**Data Processing:​**​
21
- ```
22
- pcl_reasoner_v1
23
- ├── qwen2_5_tokenizer.py # qwen2_5 tokenizer
24
- ├── packing_handler.py # Data packing process
25
- └── data_preprocess
26
- ├── decontaminate.py # validation set contamination detection
27
- └── dataset_prehandle_and_split.py # dataset prehandle and split
28
- ```
29
-
30
- ​**Model Configuration:​**​
31
- ```
32
- pcl_reasoner_v1/config
33
- ├── data_process_handling.yaml # Format conversion configuration file
34
- ├── data_process_packing.yaml # Data packing configuration file
35
- └── finetune_pcl_reasoner_v1_32k.yaml # Model fine-tuning configuration file
36
- ```
37
-
38
- ​**Task Launch Script:​**​
39
- ```
40
- pcl_reasoner_v1
41
- └── run_pcl_reasoner_v1_finetune.sh # Model fine-tuning launch script
42
- ```
43
-
44
-
45
- ### 2. Environment & Data Setup
46
- #### 2.1 Environment Installation
47
- | Software | Version |
48
- |----------|---------|
49
- | Firmware & Driver | 24.1.rc3.5 |
50
- | CANN | 7.7.T9.0.B057:8.1.RC1 |
51
- | Python | 3.10 |
52
- | MindSpore | 2.6.0 |
53
- | MindSpore TransFormers | r1.5.0 |
54
-
55
- #### 2.2 Data Processing
56
-
57
- ##### 2.2.1 Dataset Download
58
-
59
- Users can download the original dataset from HuggingFace:
60
-
61
- | Dataset Name | Dataset Link |
62
- | ------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
63
- | AM-DeepSeek-R1-0528-Distilled | [https://huggingface.co/a-m-team/AM-DeepSeek-R1-0528-Distilled](https://huggingface.co/a-m-team/AM-DeepSeek-R1-0528-Distilled) |
64
-
65
- ##### 2.2.2 Data Preprocessing
66
-
67
- First, we perform detection and filtering on the source data through two steps: ​​validation set contamination detection​​ and ​​data filtering​​.
68
-
69
- * Validation Set Contamination Detection​​: We use the ​​all-MiniLM-L6-v2​​ model to calculate text cosine similarity and detect contamination in the original mathematical data against the AIME24/25 evaluation set.
70
- After execution, the script prints detection results in the terminal and saves questions with similarity exceeding the threshold (along with matched evaluation questions) to the specified output path.
71
-
72
- ```
73
- python PCL-Reasoner-V1/pcl_reasoner_v1/data_preprocess/decontaminate.py \
74
- --target_data /path/to/target_data \
75
- --contaminant_source PCL-Reasoner-V1/pcl_reasoner_v1/data_preprocess/aime2425_questions.json \
76
- --model_path /path/to/distilled/model_path \
77
- --output_file_prefix /path/to/output_file_prefix
78
- --threshold 0.7
79
-
80
- # Parameter Description
81
- target_data: Data to be detected
82
- contaminant_source: Contamination source (evaluation set data)
83
- model_path: Model for text embedding calculation
84
- output_file_prefix: Output path for results
85
- threshold: Similarity threshold
86
- ```
87
- * Data Filtering & Processing​​: Execute the data processing script to filter data by length, selecting data where the combined length of the question and reasoning chain is ​​<32K tokens​​, and add prompts to the data.
88
-
89
- ```
90
- python PCL-Reasoner-V1/pcl_reasoner_v1/data_preprocess/convert_and_split_dataset.py \
91
- --json_file_paths /path/to/AM-DeepSeek-R1-0528-Distilled/math.jsonl
92
-
93
- # Parameter Description
94
- json_file_paths: Dataset to process (multiple paths separated by spaces)
95
- ```
96
-
97
- Then, we convert data into ​​packed format​​ through two sequential steps: format conversion and data packing.
98
-
99
- * Format Conversion​​: Specify paths like `data_files`、`vocab_file`、`merges_file` in `pcl_reasoner_v1/config/data_process_handling.yaml`, specify the custom AMDeepSeekDataHandler from pcl_reasoner_v1/packing_handler.py as the data handler:
100
-
101
- ```
102
- train_dataset:
103
- ...
104
- path: "json" # Original dataset format
105
- data_files:
106
- ["/path/to/data.jsonl"] # Path to raw dataset
107
- input_columns: *input_columns
108
- handler:
109
- - type: AMDeepSeekDataHandler # Custom data handler class
110
- ...
111
- tokenizer:
112
- auto_register: qwen2_5_tokenizer.Qwen2Tokenizer
113
- ...
114
- vocab_file: "/path/to/vocab.json" # Qwen2.5 tokenizer vocabulary
115
- merges_file: "/path/to/merges.txt" # Qwen2.5 merge rules
116
- ...
117
- ```
118
-
119
- *(Note: This is a minimal example showing frequently modified fields. Full configuration is available in the code repository.)*
120
-
121
- Execute the conversion script to generate ​​Arrow-format data​​:
122
-
123
- ```
124
- export PYTHONPATH=/path/to/mindformers/:PYTHONPATH
125
- python th/to/mindformers/toolkit/data_preprocess/huggingface/datasets_preprocess.py
126
- --config ./pcl_reasoner_v1/config/data_process_handling.yaml
127
- --save_path /path/to/handled_data/
128
- --register_path ./pcl_reasoner_v1/
129
-
130
- # Parameter Description
131
- config: Path to format conversion config file
132
- save_path: Output directory for processed data
133
- register_path: Path to custom handler registration
134
- ```
135
- * Data Packing​​:
136
-
137
- Configure pcl_reasoner_v1/config/data_process_packing.yaml to specify input paths for packed data generation:
138
-
139
- ```
140
- # dataset
141
- train_dataset:
142
- data_loader:
143
- ...
144
- path: /path/to/handled_data # Processed dataset
145
- ...
146
- ```
147
-
148
- *(Note: Example shows key fields only. Refer to repository for full config.)*
149
-
150
- Run the packing script to generate ​​sequence-packed data​:
151
-
152
- ```
153
- export PYTHONPATH=/path/to/mindformers/:PYTHONPATH
154
- python /path/to/mindformers/toolkit/data_preprocess/huggingface/datasets_preprocess.py
155
- --config ./pcl_reasoner_v1_config/data_process_packing.yaml
156
- --save_path /path/to/packed_data/
157
- --register_path ./pcl_reaoner_v1/
158
-
159
- # Parameter Description
160
- config: Path to data packing config file
161
- save_path: Output directory for packed data
162
- register_path: Path to custom handler registration
163
- ```
164
-
165
-
166
- ### 3 Training Process
167
- #### 3.1 Weight Preparation
168
-
169
- Users can download pre-trained weights from HuggingFace:
170
-
171
- | Model Name | Weights URL |
172
- | ------------------- | --------------------------------------------------------------------------------- |
173
- | Qwen2.5-32B-Base | [https://huggingface.co/Qwen/Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B) |
174
-
175
- ​**Note**: MindFormers v1.5.0+ supports direct loading/saving of `safetensors` format weights. No conversion to `ckpt` is required. Subsequent fine-tuning will use `safetensors` format.
176
-
177
- #### 3.2 Training Configuration
178
- *(Only frequently modified configurations are shown. Full config: `pcl_reasoner_v1/config/finetune_pcl_reasoner_v1_32k.yaml`)*
179
-
180
- ​***Basic Configuration:​**​*
181
- ```yaml
182
- run_mode: 'finetune' # Training mode: fine-tuning
183
- load_checkpoint: '/path/to/Qwen-32B-base/' # Weight file path
184
- load_ckpt_format: 'safetensors' # Weight format
185
- auto_trans_ckpt: True # Enable online weight splitting for distributed training
186
- ```
187
- ***Dataset Configuration:​***
188
- ```yaml
189
- train_dataset: &train_dataset
190
- data_loader:
191
- type: CommonDataLoader
192
- path: "/path/to/dataset/pack_data_lt_32K_full" # Packed dataset path
193
- load_func: 'load_from_disk' # Data loading method
194
- shuffle: True # Enable data shuffling
195
- packing: pack # Packed data format
196
- adaptor_config:
197
- compress_mask: True
198
- mock_config:
199
- seq_length: 32768 # Packed sequence length (32K tokens)
200
- size: 25909 # Dataset size / data parallelism split
201
- ```
202
- ***Parallelism Configuration:​***
203
- ```yaml
204
- parallel_config:
205
- data_parallel: &dp 8 # Data parallelism
206
- model_parallel: 8 # Model parallelism
207
- pipeline_stage: 2 # Pipeline parallelism stages
208
- use_seq_parallel: True # Enable sequence parallelism
209
- optimizer_shard: True # Enable optimizer sharding
210
- micro_batch_num: 16 # Micro-batch size
211
- ```
212
- > *(Note: This configuration example only lists frequently modified items. Refer to the code repository for complete configurations.)*
213
-
214
- #### 3.3 Launching Fine-tuning
215
-
216
- Specify the configuration file `pcl_reasoner_v1/config/finetune_pcl_reasoner_v1_32k.yaml` in the launch script `run_pcl_reasoner_v1_finetune.sh`, and modify cluster parameters according to your hardware environment:
217
-
218
- ```bash
219
- noderank=$1
220
-
221
- bash /path/to/mindformers/scripts/msrun_launcher.sh "run_mindformer.py \
222
- --config /path/to/finetune_pcl_reasoner_v1_32k.yaml \
223
- --run_mode finetune" \
224
- --worker_num 128 \
225
- --local_worker_num 8 \
226
- --master_addr XX.XX.XX.XX \
227
- --master_port XXXX \
228
- --node_rank $noderank \
229
- --log_dir /path/to/log \
230
- --join False \
231
- --cluster_time_out 1200 \
232
- > run.log 2>&1
233
-
234
- # Parameter Description
235
- config: Path to configuration file
236
- run_mode: Operation mode (pretrain/finetune/inference)
237
- worker_num: Total number of accelerator cards
238
- local_worker_num: Cards per single server
239
- master_addr: Master node address
240
- master_port: Master node port
241
- log_dir: Log directory path
242
- join: Whether to wait for all workers to exit
243
- cluster_time_out: Cluster timeout duration
244
- ```
245
- Then, launch the fine-tuning task using:
246
- ```
247
- bash run_pcl_reasoner_v1_finetune.sh 0
248
- ```
249
- > *(Note: When launching on multiple nodes, specify node_rank (e.g., 0 for the first node).)*
250
-
251
- After starting the task, monitor the runtime logs with:
252
- ```
253
- tail -f /path/to/log/worker_127.log
254
- ```
255
-
256
- ### 4. Evaluation
257
-
258
- To ensure the fairness of evaluation results, we adopted the ​**open-source evaluation code from QwQ**​ ([QwQ/eval at main · QwenLM/QwQ](https://github.com/QwenLM/QwQ)). Developers can follow the `README.md` in the code repository to set up the environment and evaluate models.
259
-
260
- #### Evaluation Hyperparameters
261
- The sampling hyperparameters used are listed below:
262
-
263
- | Hyperparameter | Value |
264
- |----------------|---------------------------------|
265
- | `temperature` | 0.6 |
266
- | `top_k` | 40 |
267
- | `top_p` | 0.95 |
268
- | `max_tokens` | 129,024 |
269
- | `chat_template`| `./pcl_reasoner_v1/eval/am_thinking.jinja` |
270
-
271
  #### Evaluation Results on AIME24/25
272
  The table below compares mainstream models on the AIME24 and AIME25 benchmarks. For accuracy, we used the ​**Avg@32 metric**​ (averaging 32 sampling attempts per query):
273
 
274
 
275
-
276
  <table>
277
  <tr>
278
  <th>Parameter Size</th>
 
12
 
13
  https://openi.pcl.ac.cn/PCL-Reasoner/V1
14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  #### Evaluation Results on AIME24/25
16
  The table below compares mainstream models on the AIME24 and AIME25 benchmarks. For accuracy, we used the ​**Avg@32 metric**​ (averaging 32 sampling attempts per query):
17
 
18
 
 
19
  <table>
20
  <tr>
21
  <th>Parameter Size</th>