File size: 8,930 Bytes
1faccd4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 | Ascend Retool Best Practice
===================================
Last updated: 03/01/2026.
引言
----------------------------------
Retool论文参考([Retool](https://arxiv.org/pdf/2504.11536))
集成代码解释器工具,通过多轮实时代码执行进行策略部署,并教会模型根据结果反馈学习何时以及如何调用工具。
1. 环境构建
2. 模型训练
用例模型脚本以及其需要的硬件条件各自如下:
=============== ============ ============ ===============
模型 NPU型号 节点数量 训推后端
=============== ============ ============ ===============
``Qwen2.5-7B`` Atlas 900 A2 1 ``vllm + FSDP``
=============== ============ ============ ===============
环境构建
-----------------------------------
1.从自定义Conda环境进行构建
============ ============================================================
software version
============ ============================================================
Python ``>= 3.10, <3.12``
CANN ``== 8.3.RC1``
torch ``== 2.7.1``
torch_npu ``== 2.7.1``
verl ``v0.6.1 commitId=d62da4950573d7a4b7ef2362337952e7ab59e78d``
vllm ``v0.11.0``
vllm-ascend ``v0.11.0-dev``
transformers ``4.57.6``
============ ============================================================
模型训练与评估
-----------------------------------
1.模型数据准备
^^^^^^^^^^^
`Qwen2.5-7B`
^^^^^^^^^^^
**下载模型权重**
--local-dir: 模型保存路径
.. code-block:: bash
git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
**下载训练数据集**
.. code-block:: bash
git clone https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k
**下载评估数据集**
.. code-block:: bash
git clone https://huggingface.co/datasets/Maxwell-Jia/AIME_2024
**下载预训练数据集**
.. code-block:: bash
python3 recipe/retool/retool_sft_preprocess.py
*注:自动下载ReTool-SFT,最后生成数据默认保存在~/ReTool-SFT/data目录下*
**执行预训练脚本**
.. code-block:: bash
bash recipe/retool/run_qwen2_7b_sft_npu.sh # 需适配脚本中路径
**合并预训练权重生成checkpoint**
.. code-block:: bash
python3 -m verl.model_merger merge --backend fsdp \
--local_dir ${DATASETS}/checkpoint/multiturn-sft-qwen-2.5-7b-instruct/global_step_372 \
--target_dir ${DATASETS}/checkpoint/multiturn-sft-qwen-2.5-7b-instruct/global_step_372/huggingface
2.代码沙箱准备
开源沙箱代码及部署参考
https://github.com/bytedance/SandboxFusion
**沙箱代码下载**
.. code-block:: bash
git clone -b main https://github.com/bytedance/SandboxFusion.git
**沙箱安装**
.. code-block:: bash
cd SandboxFusion
conda create -n sandbox -y python=3.11
conda activate sandbox
pip install poetry
poetry lock
poetry install
mkdir -p docs/build
cd runtime/python
bash install-python-runtime.sh
cd ../../
make run-online
3.训练
示例配置文件如下,在recipe/retool目录下创建一个run_qwen2.5_7b_dapo_npu.sh
根据开发者实际路径配置情况修改模型训练脚本中的以下参数
.. code-block:: bash
set -x
export VLLM_USE_V1=1
export TORCHDYNAMO_DISABLE=1
export VLLM_ASCEND_ENABLE_NZ=0
export TASK_QUEUE_ENABLE=1
export VLLM_ENABLE_GRAPH_MODE=1
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_ASCEND_ENABLE_MLP_OPTIMIZE=1
export LD_PRELOAD=/usr/local/lib/libjemalloc.so.2
# ================= data/model/tool =================
HDFS_ROOT=${HDFS_ROOT:-"${PWD}"}
DATA_ROOT=${DATA_ROOT:-"${PWD}"}
dapo_math_17k=$DATA_ROOT/dataset/BytedTsinghua-SIA/DAPO-Math-17k
aime_2024=$DATA_ROOT/dataset/Maxwell-Jia/AIME_2024
#aime_2025=$DATA_ROOT/dataset/yentinglin/aime_2025
model_path=$DATA_ROOT/dataset/checkpoint/multiturn-sft-qwen-2.5-7b-instruct/global_step_372/huggingface
train_files="['$dapo_math_17k']"
test_files="['$aime_2024']"
# tool
tool_config_path=recipe/retool/sandbox_fusion_tool_config.yaml
# wandb
project_name=retool
experiment_name=qwen2.5-7b_dapo
default_local_dir=$DATA_ROOT/checkpoint/$experiment_name
# 创建日志文件
export TIMESTAMP=$(date +%Y%m%d_%H%M%S)
LOG_DIR="$HDFS_ROOT/verl/logs/$project_name/$experiment_name"
# 判断路径是否存在
if [ ! -d "$LOG_DIR" ]; then
# 路径不存在,创建路径
mkdir -p "$LOG_DIR"
echo "Directory $LOG_DIR created."
else
echo "Directory $LOG_DIR already exists."
fi
LOG_FILE="${LOG_DIR}/${TIMESTAMP}.log"
touch "$LOG_FILE"
echo "Log file $LOG_FILE created."
# ================= algorithm =================
adv_estimator=grpo
use_kl_in_reward=False
kl_coef=0.0
use_kl_loss=False
kl_loss_coef=0.0
clip_ratio_low=0.2
clip_ratio_high=0.28
max_turns=16
max_prompt_length=2048
max_response_length=20480
actor_lr=1e-6
train_batch_size=32
ppo_mini_batch_size=16
n_resp_per_prompt=16
n_resp_per_prompt_val=30
# ================= performance =================
infer_tp=2 # vllm
train_sp=4 # train
offload=True
actor_max_token_len_per_gpu=$(( (max_prompt_length + max_response_length) * 1 ))
log_prob_max_token_len_per_gpu=$(( actor_max_token_len_per_gpu * 4 ))
PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=$adv_estimator \
algorithm.use_kl_in_reward=$use_kl_in_reward \
algorithm.kl_ctrl.kl_coef=$kl_coef \
data.train_files="$train_files" \
data.val_files="$test_files" \
data.return_raw_chat=True \
data.train_batch_size=$train_batch_size \
data.max_prompt_length=$max_prompt_length \
data.max_response_length=$max_response_length \
data.filter_overlong_prompts=True \
data.truncation='error' \
data.custom_cls.path=recipe/retool/retool.py \
data.custom_cls.name=CustomRLHFDataset \
custom_reward_function.path=recipe/retool/retool.py \
custom_reward_function.name=compute_score \
actor_rollout_ref.model.path=$model_path \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.use_kl_loss=$use_kl_loss \
actor_rollout_ref.actor.kl_loss_coef=$kl_loss_coef \
actor_rollout_ref.actor.clip_ratio_low=$clip_ratio_low \
actor_rollout_ref.actor.clip_ratio_high=$clip_ratio_high \
actor_rollout_ref.actor.clip_ratio_c=10.0 \
actor_rollout_ref.actor.optim.lr=$actor_lr \
actor_rollout_ref.actor.use_dynamic_bsz=True \
actor_rollout_ref.actor.ppo_mini_batch_size=$ppo_mini_batch_size \
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=$actor_max_token_len_per_gpu \
actor_rollout_ref.actor.ulysses_sequence_parallel_size=$train_sp \
actor_rollout_ref.actor.fsdp_config.param_offload=$offload \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=$offload \
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=$log_prob_max_token_len_per_gpu \
actor_rollout_ref.rollout.max_num_batched_tokens=$actor_max_token_len_per_gpu \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.mode=async \
actor_rollout_ref.rollout.max_num_seqs=1024 \
actor_rollout_ref.rollout.tensor_model_parallel_size=$infer_tp \
actor_rollout_ref.rollout.multi_turn.enable=True \
actor_rollout_ref.rollout.multi_turn.max_user_turns=$max_turns \
actor_rollout_ref.rollout.multi_turn.max_assistant_turns=$max_turns \
actor_rollout_ref.rollout.multi_turn.tool_config_path=$tool_config_path \
actor_rollout_ref.rollout.multi_turn.format=hermes \
actor_rollout_ref.rollout.gpu_memory_utilization=0.9 \
actor_rollout_ref.rollout.n=$n_resp_per_prompt \
actor_rollout_ref.rollout.val_kwargs.top_p=0.6 \
actor_rollout_ref.rollout.val_kwargs.temperature=1.0 \
actor_rollout_ref.rollout.val_kwargs.n=$n_resp_per_prompt_val \
actor_rollout_ref.rollout.enable_chunked_prefill=True \
actor_rollout_ref.rollout.enforce_eager=False \
trainer.logger=['console'] \
trainer.project_name=$project_name \
trainer.experiment_name=$experiment_name \
trainer.n_gpus_per_node=8 \
trainer.val_before_train=False \
trainer.log_val_generations=20 \
trainer.nnodes=1 \
trainer.save_freq=100 \
trainer.default_local_dir=$default_local_dir \
trainer.test_freq=20 \
trainer.device=npu \
actor_rollout_ref.actor.entropy_from_logits_with_chunking=True \
actor_rollout_ref.ref.entropy_from_logits_with_chunking=True \
actor_rollout_ref.actor.use_torch_compile=False \
actor_rollout_ref.ref.use_torch_compile=False \
actor_rollout_ref.actor.entropy_checkpointing=True \
actor_rollout_ref.ref.entropy_checkpointing=True \
actor_rollout_ref.ref.use_torch_compile=False \
trainer.total_epochs=1 $@ > $LOG_FILE 2>&1 &
|