| Ascend Retool Best Practice |
| =================================== |
|
|
| Last updated: 03/01/2026. |
|
|
| 引言 |
| ---------------------------------- |
|
|
| Retool论文参考([Retool](https://arxiv.org/pdf/2504.11536)) |
| 集成代码解释器工具,通过多轮实时代码执行进行策略部署,并教会模型根据结果反馈学习何时以及如何调用工具。 |
|
|
| 1. 环境构建 |
| 2. 模型训练 |
|
|
| 用例模型脚本以及其需要的硬件条件各自如下: |
|
|
| =============== ============ ============ =============== |
| 模型 NPU型号 节点数量 训推后端 |
| =============== ============ ============ =============== |
| ``Qwen2.5-7B`` Atlas 900 A2 1 ``vllm + FSDP`` |
| =============== ============ ============ =============== |
|
|
| 环境构建 |
| ----------------------------------- |
| 1.从自定义Conda环境进行构建 |
|
|
| ============ ============================================================ |
| software version |
| ============ ============================================================ |
| Python ``>= 3.10, <3.12`` |
| CANN ``== 8.3.RC1`` |
| torch ``== 2.7.1`` |
| torch_npu ``== 2.7.1`` |
| verl ``v0.6.1 commitId=d62da4950573d7a4b7ef2362337952e7ab59e78d`` |
| vllm ``v0.11.0`` |
| vllm-ascend ``v0.11.0-dev`` |
| transformers ``4.57.6`` |
| ============ ============================================================ |
|
|
| 模型训练与评估 |
| ----------------------------------- |
| 1.模型数据准备 |
| ^^^^^^^^^^^ |
| `Qwen2.5-7B` |
| ^^^^^^^^^^^ |
| **下载模型权重** |
|
|
| --local-dir: 模型保存路径 |
|
|
| .. code-block:: bash |
|
|
| git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct |
|
|
| **下载训练数据集** |
|
|
| .. code-block:: bash |
|
|
| git clone https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k |
|
|
| **下载评估数据集** |
|
|
| .. code-block:: bash |
|
|
| git clone https://huggingface.co/datasets/Maxwell-Jia/AIME_2024 |
|
|
| **下载预训练数据集** |
|
|
| .. code-block:: bash |
|
|
| python3 recipe/retool/retool_sft_preprocess.py |
|
|
| *注:自动下载ReTool-SFT,最后生成数据默认保存在~/ReTool-SFT/data目录下* |
|
|
| **执行预训练脚本** |
|
|
| .. code-block:: bash |
|
|
| bash recipe/retool/run_qwen2_7b_sft_npu.sh # 需适配脚本中路径 |
|
|
| **合并预训练权重生成checkpoint** |
|
|
| .. code-block:: bash |
|
|
| python3 -m verl.model_merger merge --backend fsdp \ |
| --local_dir ${DATASETS}/checkpoint/multiturn-sft-qwen-2.5-7b-instruct/global_step_372 \ |
| --target_dir ${DATASETS}/checkpoint/multiturn-sft-qwen-2.5-7b-instruct/global_step_372/huggingface |
|
|
| 2.代码沙箱准备 |
|
|
| 开源沙箱代码及部署参考 |
| https://github.com/bytedance/SandboxFusion |
|
|
| **沙箱代码下载** |
|
|
| .. code-block:: bash |
|
|
| git clone -b main https://github.com/bytedance/SandboxFusion.git |
|
|
| **沙箱安装** |
|
|
| .. code-block:: bash |
|
|
| cd SandboxFusion |
| conda create -n sandbox -y python=3.11 |
| conda activate sandbox |
| pip install poetry |
| poetry lock |
| poetry install |
| mkdir -p docs/build |
| cd runtime/python |
| bash install-python-runtime.sh |
| cd ../../ |
| make run-online |
|
|
| 3.训练 |
|
|
| 示例配置文件如下,在recipe/retool目录下创建一个run_qwen2.5_7b_dapo_npu.sh |
| 根据开发者实际路径配置情况修改模型训练脚本中的以下参数 |
|
|
| .. code-block:: bash |
|
|
| set -x |
|
|
| export VLLM_USE_V1=1 |
| export TORCHDYNAMO_DISABLE=1 |
| export VLLM_ASCEND_ENABLE_NZ=0 |
| export TASK_QUEUE_ENABLE=1 |
| export VLLM_ENABLE_GRAPH_MODE=1 |
| export HCCL_OP_EXPANSION_MODE="AIV" |
| export VLLM_ASCEND_ENABLE_MLP_OPTIMIZE=1 |
| export LD_PRELOAD=/usr/local/lib/libjemalloc.so.2 |
| |
| # ================= data/model/tool ================= |
| HDFS_ROOT=${HDFS_ROOT:-"${PWD}"} |
| DATA_ROOT=${DATA_ROOT:-"${PWD}"} |
| |
| dapo_math_17k=$DATA_ROOT/dataset/BytedTsinghua-SIA/DAPO-Math-17k |
| aime_2024=$DATA_ROOT/dataset/Maxwell-Jia/AIME_2024 |
| #aime_2025=$DATA_ROOT/dataset/yentinglin/aime_2025 |
| model_path=$DATA_ROOT/dataset/checkpoint/multiturn-sft-qwen-2.5-7b-instruct/global_step_372/huggingface |
| |
| train_files="['$dapo_math_17k']" |
| test_files="['$aime_2024']" |
| |
| # tool |
| tool_config_path=recipe/retool/sandbox_fusion_tool_config.yaml |
| |
| # wandb |
| project_name=retool |
| experiment_name=qwen2.5-7b_dapo |
| default_local_dir=$DATA_ROOT/checkpoint/$experiment_name |
| |
| # 创建日志文件 |
| export TIMESTAMP=$(date +%Y%m%d_%H%M%S) |
| LOG_DIR="$HDFS_ROOT/verl/logs/$project_name/$experiment_name" |
| # 判断路径是否存在 |
| if [ ! -d "$LOG_DIR" ]; then |
| # 路径不存在,创建路径 |
| mkdir -p "$LOG_DIR" |
| echo "Directory $LOG_DIR created." |
| else |
| echo "Directory $LOG_DIR already exists." |
| fi |
| |
| LOG_FILE="${LOG_DIR}/${TIMESTAMP}.log" |
| touch "$LOG_FILE" |
| echo "Log file $LOG_FILE created." |
|
|
| # ================= algorithm ================= |
| adv_estimator=grpo |
| |
| use_kl_in_reward=False |
| kl_coef=0.0 |
| use_kl_loss=False |
| kl_loss_coef=0.0 |
| |
| clip_ratio_low=0.2 |
| clip_ratio_high=0.28 |
| |
| max_turns=16 |
| max_prompt_length=2048 |
| max_response_length=20480 |
| actor_lr=1e-6 |
| |
| train_batch_size=32 |
| ppo_mini_batch_size=16 |
| |
| n_resp_per_prompt=16 |
| n_resp_per_prompt_val=30 |
| |
| # ================= performance ================= |
| infer_tp=2 # vllm |
| train_sp=4 # train |
| offload=True |
| |
| actor_max_token_len_per_gpu=$(( (max_prompt_length + max_response_length) * 1 )) |
| log_prob_max_token_len_per_gpu=$(( actor_max_token_len_per_gpu * 4 )) |
|
|
| PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \ |
| algorithm.adv_estimator=$adv_estimator \ |
| algorithm.use_kl_in_reward=$use_kl_in_reward \ |
| algorithm.kl_ctrl.kl_coef=$kl_coef \ |
| data.train_files="$train_files" \ |
| data.val_files="$test_files" \ |
| data.return_raw_chat=True \ |
| data.train_batch_size=$train_batch_size \ |
| data.max_prompt_length=$max_prompt_length \ |
| data.max_response_length=$max_response_length \ |
| data.filter_overlong_prompts=True \ |
| data.truncation='error' \ |
| data.custom_cls.path=recipe/retool/retool.py \ |
| data.custom_cls.name=CustomRLHFDataset \ |
| custom_reward_function.path=recipe/retool/retool.py \ |
| custom_reward_function.name=compute_score \ |
| actor_rollout_ref.model.path=$model_path \ |
| actor_rollout_ref.model.use_remove_padding=True \ |
| actor_rollout_ref.model.enable_gradient_checkpointing=True \ |
| actor_rollout_ref.actor.use_kl_loss=$use_kl_loss \ |
| actor_rollout_ref.actor.kl_loss_coef=$kl_loss_coef \ |
| actor_rollout_ref.actor.clip_ratio_low=$clip_ratio_low \ |
| actor_rollout_ref.actor.clip_ratio_high=$clip_ratio_high \ |
| actor_rollout_ref.actor.clip_ratio_c=10.0 \ |
| actor_rollout_ref.actor.optim.lr=$actor_lr \ |
| actor_rollout_ref.actor.use_dynamic_bsz=True \ |
| actor_rollout_ref.actor.ppo_mini_batch_size=$ppo_mini_batch_size \ |
| actor_rollout_ref.actor.ppo_max_token_len_per_gpu=$actor_max_token_len_per_gpu \ |
| actor_rollout_ref.actor.ulysses_sequence_parallel_size=$train_sp \ |
| actor_rollout_ref.actor.fsdp_config.param_offload=$offload \ |
| actor_rollout_ref.actor.fsdp_config.optimizer_offload=$offload \ |
| actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=$log_prob_max_token_len_per_gpu \ |
| actor_rollout_ref.rollout.max_num_batched_tokens=$actor_max_token_len_per_gpu \ |
| actor_rollout_ref.rollout.name=vllm \ |
| actor_rollout_ref.rollout.mode=async \ |
| actor_rollout_ref.rollout.max_num_seqs=1024 \ |
| actor_rollout_ref.rollout.tensor_model_parallel_size=$infer_tp \ |
| actor_rollout_ref.rollout.multi_turn.enable=True \ |
| actor_rollout_ref.rollout.multi_turn.max_user_turns=$max_turns \ |
| actor_rollout_ref.rollout.multi_turn.max_assistant_turns=$max_turns \ |
| actor_rollout_ref.rollout.multi_turn.tool_config_path=$tool_config_path \ |
| actor_rollout_ref.rollout.multi_turn.format=hermes \ |
| actor_rollout_ref.rollout.gpu_memory_utilization=0.9 \ |
| actor_rollout_ref.rollout.n=$n_resp_per_prompt \ |
| actor_rollout_ref.rollout.val_kwargs.top_p=0.6 \ |
| actor_rollout_ref.rollout.val_kwargs.temperature=1.0 \ |
| actor_rollout_ref.rollout.val_kwargs.n=$n_resp_per_prompt_val \ |
| actor_rollout_ref.rollout.enable_chunked_prefill=True \ |
| actor_rollout_ref.rollout.enforce_eager=False \ |
| trainer.logger=['console'] \ |
| trainer.project_name=$project_name \ |
| trainer.experiment_name=$experiment_name \ |
| trainer.n_gpus_per_node=8 \ |
| trainer.val_before_train=False \ |
| trainer.log_val_generations=20 \ |
| trainer.nnodes=1 \ |
| trainer.save_freq=100 \ |
| trainer.default_local_dir=$default_local_dir \ |
| trainer.test_freq=20 \ |
| trainer.device=npu \ |
| actor_rollout_ref.actor.entropy_from_logits_with_chunking=True \ |
| actor_rollout_ref.ref.entropy_from_logits_with_chunking=True \ |
| actor_rollout_ref.actor.use_torch_compile=False \ |
| actor_rollout_ref.ref.use_torch_compile=False \ |
| actor_rollout_ref.actor.entropy_checkpointing=True \ |
| actor_rollout_ref.ref.entropy_checkpointing=True \ |
| actor_rollout_ref.ref.use_torch_compile=False \ |
| trainer.total_epochs=1 $@ > $LOG_FILE 2>&1 & |
|
|