| # training | |
| ## 单node自动training | |
| scripts/training/node.sh | |
| ``` | |
| #agent名字,yaml文件名 | |
| agent="hydra_pe" | |
| #不管这个 | |
| cache="null" | |
| #训练参数 | |
| bs=32 | |
| lr=0.0002 | |
| epoch=20 | |
| #navsim有三个split:train val test 这里有两个选项: | |
| 1.default_training -- 用navtrain里的train split训,测在navtest(test split)上测 | |
| 2.competition_training -- 用navtrain里的train+val split训,测在navtest(test split)上测 | |
| #hydramdp第一个表小模型resnet34,我都用了default training | |
| #第二个表大模型vov、vitl、。。。,我都用了competition training | |
| config="competition_training" | |
| #最后所有的ckpt,tensorboard log都保存在这里 | |
| #完整路径是/zhenxinl_nuplan/navsim_workspace/exp/$dir | |
| dir=${agent}_lr2_ckpt | |
| ``` | |
| ## 多node自动training | |
| ``` | |
| agent="hydra_pe" | |
| bs=8 | |
| lr=0.0002 | |
| cache="null" | |
| config="competition_training" | |
| epoch=10 | |
| #相比前面多了一个这个,每个replica有8张卡 | |
| #前面的bs是单卡的bs,总的bs大小为bs*replicas | |
| #如果要改replicas数量,要按比例改lr,总bs*2那么lr也*2 | |
| replicas=8 | |
| ``` | |
| hydra_offset_vov_fixedpading_modify_head0.01_bs8x8_ckpt | |
| ## 下载tensorboard 文件 | |
| 1. 进一个ngc机器:sleep/node/nodes哪个启动的都行 | |
| 2. cd /zhenxinl_nuplan/navsim_workspace/exp/$dir | |
| 3. find . -name event* | |
| 4. 可能会给你列很多个event*,得用ls -l看看那个是不是最大的 | |
| 5. 跳板机起一个新的终端,vscode里就是(ctrl+`),cd到你想保存tensorboard文件的文件夹 | |
| 6. ngc workspace download ngc workspace download --file ./navsim_workspace/exp/event路径 q-2TlPKESo62ktTxOc8rYg | |
| 7. 这样就把tensorboard下到跳板机上了 | |
| 8. 可以vscode直接ctrl+shift+p打开tensorboard看 | |
| ## eval | |
| 1. sleep一个ngc机器,ngcexe进入 | |
| 2. tmux一下,防止你断联,再进入ngc机器就tmux attach -t 0回到这个终端 | |
| 3. 这一步把你文件及里面的乱七八糟的ckpt都统一命名为epoch05.ckpt,... | |
| ``` | |
| cd ${NAVSIM_EXP_ROOT}/$agent_ckpt; | |
| for file in epoch=*-step=*.ckpt; do | |
| epoch=$(echo $file | sed -n 's/.*epoch=\([0-9][0-9]\).*/\1/p') | |
| new_filename="epoch${epoch}.ckpt" | |
| mv "$file" "$new_filename" | |
| done | |
| cd /navsim_ours; | |
| ``` | |
| 4. 下面这一步,对epoch00到epoch09都进行一遍eval,你如果觉得很慢,可以新创一台机器,一个00到04,一个05到09. | |
| ``` | |
| epochs=(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19); | |
| ckpts=( | |
| epoch00.ckpt epoch01.ckpt epoch02.ckpt epoch03.ckpt epoch04.ckpt epoch05.ckpt epoch06.ckpt epoch07.ckpt epoch08.ckpt epoch09.ckpt | |
| epoch10.ckpt epoch11.ckpt epoch12.ckpt epoch13.ckpt epoch14.ckpt epoch15.ckpt epoch16.ckpt epoch17.ckpt epoch18.ckpt epoch19.ckpt | |
| ) | |
| for i in {0..9}; do | |
| python ${NAVSIM_DEVKIT_ROOT}/navsim/planning/script/run_pdm_score_gpu.py \ | |
| +use_pdm_closed=false \ | |
| agent=$agent \ | |
| dataloader.params.batch_size=8 \ | |
| worker.threads_per_node=64 \ | |
| agent.checkpoint_path=${NAVSIM_EXP_ROOT}/${agent_ckpt}/${ckpts[$i]} \ | |
| experiment_name=${agent_ckpt}/${epochs[$i]}_xformers \ | |
| +cache_path=null \ | |
| metric_cache_path=${NAVSIM_EXP_ROOT}/navtest_cache \ | |
| split=test \ | |
| scene_filter=navtest; | |
| done | |
| ``` | |
| 5. 上面的eval完文件夹会长这样: | |
|  | |
| xx_xformers里面放了你的eval分数,inference weights使用的是hydra_model_pe 340行的weights先测了一遍。 | |
| 要看这些初始分数可以用,我一般用这个选最好的epoch: | |
| ``` | |
| for epoch in 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19; do | |
| echo ===================${epoch}=================== | |
| cat $(find ./${epoch}_xformers/ -type f -name "*.csv") "end" | tail -n 1 | |
| done | |
| ``` | |
| 然后会有一些epochxx.pkl,这个里面放着模型所有的小分,用来grid search | |
| 6. grid search,你可以调一调grid search里的参数, 跑完看结果就行了 | |
| ``` | |
| python ${NAVSIM_DEVKIT_ROOT}/navsim/planning/script/grid_search_unlog.py \ | |
| --pkl_path ${NAVSIM_EXP_ROOT}/hydra_pe_vov_bs8x8_ckpt/epoch13.pkl | |
| ``` | |