Update README.md
Browse files
README.md
CHANGED
|
@@ -12,267 +12,10 @@ https://github.com/PCL-Reasoner/V1
|
|
| 12 |
|
| 13 |
https://openi.pcl.ac.cn/PCL-Reasoner/V1
|
| 14 |
|
| 15 |
-
## Development Guide
|
| 16 |
-
|
| 17 |
-
### 1. Model Files
|
| 18 |
-
PCL-Reasoner-V1 is fine-tuned from Qwen2.5-32B-Base using MindFormers. Key files include:
|
| 19 |
-
|
| 20 |
-
**Data Processing:**
|
| 21 |
-
```
|
| 22 |
-
pcl_reasoner_v1
|
| 23 |
-
├── qwen2_5_tokenizer.py # qwen2_5 tokenizer
|
| 24 |
-
├── packing_handler.py # Data packing process
|
| 25 |
-
└── data_preprocess
|
| 26 |
-
├── decontaminate.py # validation set contamination detection
|
| 27 |
-
└── dataset_prehandle_and_split.py # dataset prehandle and split
|
| 28 |
-
```
|
| 29 |
-
|
| 30 |
-
**Model Configuration:**
|
| 31 |
-
```
|
| 32 |
-
pcl_reasoner_v1/config
|
| 33 |
-
├── data_process_handling.yaml # Format conversion configuration file
|
| 34 |
-
├── data_process_packing.yaml # Data packing configuration file
|
| 35 |
-
└── finetune_pcl_reasoner_v1_32k.yaml # Model fine-tuning configuration file
|
| 36 |
-
```
|
| 37 |
-
|
| 38 |
-
**Task Launch Script:**
|
| 39 |
-
```
|
| 40 |
-
pcl_reasoner_v1
|
| 41 |
-
└── run_pcl_reasoner_v1_finetune.sh # Model fine-tuning launch script
|
| 42 |
-
```
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
### 2. Environment & Data Setup
|
| 46 |
-
#### 2.1 Environment Installation
|
| 47 |
-
| Software | Version |
|
| 48 |
-
|----------|---------|
|
| 49 |
-
| Firmware & Driver | 24.1.rc3.5 |
|
| 50 |
-
| CANN | 7.7.T9.0.B057:8.1.RC1 |
|
| 51 |
-
| Python | 3.10 |
|
| 52 |
-
| MindSpore | 2.6.0 |
|
| 53 |
-
| MindSpore TransFormers | r1.5.0 |
|
| 54 |
-
|
| 55 |
-
#### 2.2 Data Processing
|
| 56 |
-
|
| 57 |
-
##### 2.2.1 Dataset Download
|
| 58 |
-
|
| 59 |
-
Users can download the original dataset from HuggingFace:
|
| 60 |
-
|
| 61 |
-
| Dataset Name | Dataset Link |
|
| 62 |
-
| ------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
|
| 63 |
-
| AM-DeepSeek-R1-0528-Distilled | [https://huggingface.co/a-m-team/AM-DeepSeek-R1-0528-Distilled](https://huggingface.co/a-m-team/AM-DeepSeek-R1-0528-Distilled) |
|
| 64 |
-
|
| 65 |
-
##### 2.2.2 Data Preprocessing
|
| 66 |
-
|
| 67 |
-
First, we perform detection and filtering on the source data through two steps: validation set contamination detection and data filtering.
|
| 68 |
-
|
| 69 |
-
* Validation Set Contamination Detection: We use the all-MiniLM-L6-v2 model to calculate text cosine similarity and detect contamination in the original mathematical data against the AIME24/25 evaluation set.
|
| 70 |
-
After execution, the script prints detection results in the terminal and saves questions with similarity exceeding the threshold (along with matched evaluation questions) to the specified output path.
|
| 71 |
-
|
| 72 |
-
```
|
| 73 |
-
python PCL-Reasoner-V1/pcl_reasoner_v1/data_preprocess/decontaminate.py \
|
| 74 |
-
--target_data /path/to/target_data \
|
| 75 |
-
--contaminant_source PCL-Reasoner-V1/pcl_reasoner_v1/data_preprocess/aime2425_questions.json \
|
| 76 |
-
--model_path /path/to/distilled/model_path \
|
| 77 |
-
--output_file_prefix /path/to/output_file_prefix
|
| 78 |
-
--threshold 0.7
|
| 79 |
-
|
| 80 |
-
# Parameter Description
|
| 81 |
-
target_data: Data to be detected
|
| 82 |
-
contaminant_source: Contamination source (evaluation set data)
|
| 83 |
-
model_path: Model for text embedding calculation
|
| 84 |
-
output_file_prefix: Output path for results
|
| 85 |
-
threshold: Similarity threshold
|
| 86 |
-
```
|
| 87 |
-
* Data Filtering & Processing: Execute the data processing script to filter data by length, selecting data where the combined length of the question and reasoning chain is <32K tokens, and add prompts to the data.
|
| 88 |
-
|
| 89 |
-
```
|
| 90 |
-
python PCL-Reasoner-V1/pcl_reasoner_v1/data_preprocess/convert_and_split_dataset.py \
|
| 91 |
-
--json_file_paths /path/to/AM-DeepSeek-R1-0528-Distilled/math.jsonl
|
| 92 |
-
|
| 93 |
-
# Parameter Description
|
| 94 |
-
json_file_paths: Dataset to process (multiple paths separated by spaces)
|
| 95 |
-
```
|
| 96 |
-
|
| 97 |
-
Then, we convert data into packed format through two sequential steps: format conversion and data packing.
|
| 98 |
-
|
| 99 |
-
* Format Conversion: Specify paths like `data_files`、`vocab_file`、`merges_file` in `pcl_reasoner_v1/config/data_process_handling.yaml`, specify the custom AMDeepSeekDataHandler from pcl_reasoner_v1/packing_handler.py as the data handler:
|
| 100 |
-
|
| 101 |
-
```
|
| 102 |
-
train_dataset:
|
| 103 |
-
...
|
| 104 |
-
path: "json" # Original dataset format
|
| 105 |
-
data_files:
|
| 106 |
-
["/path/to/data.jsonl"] # Path to raw dataset
|
| 107 |
-
input_columns: *input_columns
|
| 108 |
-
handler:
|
| 109 |
-
- type: AMDeepSeekDataHandler # Custom data handler class
|
| 110 |
-
...
|
| 111 |
-
tokenizer:
|
| 112 |
-
auto_register: qwen2_5_tokenizer.Qwen2Tokenizer
|
| 113 |
-
...
|
| 114 |
-
vocab_file: "/path/to/vocab.json" # Qwen2.5 tokenizer vocabulary
|
| 115 |
-
merges_file: "/path/to/merges.txt" # Qwen2.5 merge rules
|
| 116 |
-
...
|
| 117 |
-
```
|
| 118 |
-
|
| 119 |
-
*(Note: This is a minimal example showing frequently modified fields. Full configuration is available in the code repository.)*
|
| 120 |
-
|
| 121 |
-
Execute the conversion script to generate Arrow-format data:
|
| 122 |
-
|
| 123 |
-
```
|
| 124 |
-
export PYTHONPATH=/path/to/mindformers/:PYTHONPATH
|
| 125 |
-
python th/to/mindformers/toolkit/data_preprocess/huggingface/datasets_preprocess.py
|
| 126 |
-
--config ./pcl_reasoner_v1/config/data_process_handling.yaml
|
| 127 |
-
--save_path /path/to/handled_data/
|
| 128 |
-
--register_path ./pcl_reasoner_v1/
|
| 129 |
-
|
| 130 |
-
# Parameter Description
|
| 131 |
-
config: Path to format conversion config file
|
| 132 |
-
save_path: Output directory for processed data
|
| 133 |
-
register_path: Path to custom handler registration
|
| 134 |
-
```
|
| 135 |
-
* Data Packing:
|
| 136 |
-
|
| 137 |
-
Configure pcl_reasoner_v1/config/data_process_packing.yaml to specify input paths for packed data generation:
|
| 138 |
-
|
| 139 |
-
```
|
| 140 |
-
# dataset
|
| 141 |
-
train_dataset:
|
| 142 |
-
data_loader:
|
| 143 |
-
...
|
| 144 |
-
path: /path/to/handled_data # Processed dataset
|
| 145 |
-
...
|
| 146 |
-
```
|
| 147 |
-
|
| 148 |
-
*(Note: Example shows key fields only. Refer to repository for full config.)*
|
| 149 |
-
|
| 150 |
-
Run the packing script to generate sequence-packed data:
|
| 151 |
-
|
| 152 |
-
```
|
| 153 |
-
export PYTHONPATH=/path/to/mindformers/:PYTHONPATH
|
| 154 |
-
python /path/to/mindformers/toolkit/data_preprocess/huggingface/datasets_preprocess.py
|
| 155 |
-
--config ./pcl_reasoner_v1_config/data_process_packing.yaml
|
| 156 |
-
--save_path /path/to/packed_data/
|
| 157 |
-
--register_path ./pcl_reaoner_v1/
|
| 158 |
-
|
| 159 |
-
# Parameter Description
|
| 160 |
-
config: Path to data packing config file
|
| 161 |
-
save_path: Output directory for packed data
|
| 162 |
-
register_path: Path to custom handler registration
|
| 163 |
-
```
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
### 3 Training Process
|
| 167 |
-
#### 3.1 Weight Preparation
|
| 168 |
-
|
| 169 |
-
Users can download pre-trained weights from HuggingFace:
|
| 170 |
-
|
| 171 |
-
| Model Name | Weights URL |
|
| 172 |
-
| ------------------- | --------------------------------------------------------------------------------- |
|
| 173 |
-
| Qwen2.5-32B-Base | [https://huggingface.co/Qwen/Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B) |
|
| 174 |
-
|
| 175 |
-
**Note**: MindFormers v1.5.0+ supports direct loading/saving of `safetensors` format weights. No conversion to `ckpt` is required. Subsequent fine-tuning will use `safetensors` format.
|
| 176 |
-
|
| 177 |
-
#### 3.2 Training Configuration
|
| 178 |
-
*(Only frequently modified configurations are shown. Full config: `pcl_reasoner_v1/config/finetune_pcl_reasoner_v1_32k.yaml`)*
|
| 179 |
-
|
| 180 |
-
***Basic Configuration:***
|
| 181 |
-
```yaml
|
| 182 |
-
run_mode: 'finetune' # Training mode: fine-tuning
|
| 183 |
-
load_checkpoint: '/path/to/Qwen-32B-base/' # Weight file path
|
| 184 |
-
load_ckpt_format: 'safetensors' # Weight format
|
| 185 |
-
auto_trans_ckpt: True # Enable online weight splitting for distributed training
|
| 186 |
-
```
|
| 187 |
-
***Dataset Configuration:***
|
| 188 |
-
```yaml
|
| 189 |
-
train_dataset: &train_dataset
|
| 190 |
-
data_loader:
|
| 191 |
-
type: CommonDataLoader
|
| 192 |
-
path: "/path/to/dataset/pack_data_lt_32K_full" # Packed dataset path
|
| 193 |
-
load_func: 'load_from_disk' # Data loading method
|
| 194 |
-
shuffle: True # Enable data shuffling
|
| 195 |
-
packing: pack # Packed data format
|
| 196 |
-
adaptor_config:
|
| 197 |
-
compress_mask: True
|
| 198 |
-
mock_config:
|
| 199 |
-
seq_length: 32768 # Packed sequence length (32K tokens)
|
| 200 |
-
size: 25909 # Dataset size / data parallelism split
|
| 201 |
-
```
|
| 202 |
-
***Parallelism Configuration:***
|
| 203 |
-
```yaml
|
| 204 |
-
parallel_config:
|
| 205 |
-
data_parallel: &dp 8 # Data parallelism
|
| 206 |
-
model_parallel: 8 # Model parallelism
|
| 207 |
-
pipeline_stage: 2 # Pipeline parallelism stages
|
| 208 |
-
use_seq_parallel: True # Enable sequence parallelism
|
| 209 |
-
optimizer_shard: True # Enable optimizer sharding
|
| 210 |
-
micro_batch_num: 16 # Micro-batch size
|
| 211 |
-
```
|
| 212 |
-
> *(Note: This configuration example only lists frequently modified items. Refer to the code repository for complete configurations.)*
|
| 213 |
-
|
| 214 |
-
#### 3.3 Launching Fine-tuning
|
| 215 |
-
|
| 216 |
-
Specify the configuration file `pcl_reasoner_v1/config/finetune_pcl_reasoner_v1_32k.yaml` in the launch script `run_pcl_reasoner_v1_finetune.sh`, and modify cluster parameters according to your hardware environment:
|
| 217 |
-
|
| 218 |
-
```bash
|
| 219 |
-
noderank=$1
|
| 220 |
-
|
| 221 |
-
bash /path/to/mindformers/scripts/msrun_launcher.sh "run_mindformer.py \
|
| 222 |
-
--config /path/to/finetune_pcl_reasoner_v1_32k.yaml \
|
| 223 |
-
--run_mode finetune" \
|
| 224 |
-
--worker_num 128 \
|
| 225 |
-
--local_worker_num 8 \
|
| 226 |
-
--master_addr XX.XX.XX.XX \
|
| 227 |
-
--master_port XXXX \
|
| 228 |
-
--node_rank $noderank \
|
| 229 |
-
--log_dir /path/to/log \
|
| 230 |
-
--join False \
|
| 231 |
-
--cluster_time_out 1200 \
|
| 232 |
-
> run.log 2>&1
|
| 233 |
-
|
| 234 |
-
# Parameter Description
|
| 235 |
-
config: Path to configuration file
|
| 236 |
-
run_mode: Operation mode (pretrain/finetune/inference)
|
| 237 |
-
worker_num: Total number of accelerator cards
|
| 238 |
-
local_worker_num: Cards per single server
|
| 239 |
-
master_addr: Master node address
|
| 240 |
-
master_port: Master node port
|
| 241 |
-
log_dir: Log directory path
|
| 242 |
-
join: Whether to wait for all workers to exit
|
| 243 |
-
cluster_time_out: Cluster timeout duration
|
| 244 |
-
```
|
| 245 |
-
Then, launch the fine-tuning task using:
|
| 246 |
-
```
|
| 247 |
-
bash run_pcl_reasoner_v1_finetune.sh 0
|
| 248 |
-
```
|
| 249 |
-
> *(Note: When launching on multiple nodes, specify node_rank (e.g., 0 for the first node).)*
|
| 250 |
-
|
| 251 |
-
After starting the task, monitor the runtime logs with:
|
| 252 |
-
```
|
| 253 |
-
tail -f /path/to/log/worker_127.log
|
| 254 |
-
```
|
| 255 |
-
|
| 256 |
-
### 4. Evaluation
|
| 257 |
-
|
| 258 |
-
To ensure the fairness of evaluation results, we adopted the **open-source evaluation code from QwQ** ([QwQ/eval at main · QwenLM/QwQ](https://github.com/QwenLM/QwQ)). Developers can follow the `README.md` in the code repository to set up the environment and evaluate models.
|
| 259 |
-
|
| 260 |
-
#### Evaluation Hyperparameters
|
| 261 |
-
The sampling hyperparameters used are listed below:
|
| 262 |
-
|
| 263 |
-
| Hyperparameter | Value |
|
| 264 |
-
|----------------|---------------------------------|
|
| 265 |
-
| `temperature` | 0.6 |
|
| 266 |
-
| `top_k` | 40 |
|
| 267 |
-
| `top_p` | 0.95 |
|
| 268 |
-
| `max_tokens` | 129,024 |
|
| 269 |
-
| `chat_template`| `./pcl_reasoner_v1/eval/am_thinking.jinja` |
|
| 270 |
-
|
| 271 |
#### Evaluation Results on AIME24/25
|
| 272 |
The table below compares mainstream models on the AIME24 and AIME25 benchmarks. For accuracy, we used the **Avg@32 metric** (averaging 32 sampling attempts per query):
|
| 273 |
|
| 274 |
|
| 275 |
-
|
| 276 |
<table>
|
| 277 |
<tr>
|
| 278 |
<th>Parameter Size</th>
|
|
|
|
| 12 |
|
| 13 |
https://openi.pcl.ac.cn/PCL-Reasoner/V1
|
| 14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
#### Evaluation Results on AIME24/25
|
| 16 |
The table below compares mainstream models on the AIME24 and AIME25 benchmarks. For accuracy, we used the **Avg@32 metric** (averaging 32 sampling attempts per query):
|
| 17 |
|
| 18 |
|
|
|
|
| 19 |
<table>
|
| 20 |
<tr>
|
| 21 |
<th>Parameter Size</th>
|