Update HuggingFace upload instructions for split archives
Browse files- setup_simtoken.md +21 -77
setup_simtoken.md
CHANGED
|
@@ -57,42 +57,7 @@ PY
|
|
| 57 |
|
| 58 |
---
|
| 59 |
|
| 60 |
-
## 2.
|
| 61 |
-
|
| 62 |
-
使用服务器平台的迁移工具完成目录迁移后,在新机器上确认关键文件:
|
| 63 |
-
|
| 64 |
-
```bash
|
| 65 |
-
cd /workspace/SimToken
|
| 66 |
-
|
| 67 |
-
ls -lh checkpoints/simtoken_pretrained.pth
|
| 68 |
-
ls -lh models/segment_anything/sam_vit_h_4b8939.pth
|
| 69 |
-
ls -d data/image_embed data/gt_mask data/audio_embed data/media
|
| 70 |
-
```
|
| 71 |
-
|
| 72 |
-
如果迁移后只有压缩包而没有解压目录,重新解压:
|
| 73 |
-
|
| 74 |
-
```bash
|
| 75 |
-
cd /workspace/SimToken/data
|
| 76 |
-
|
| 77 |
-
tar -xf image_embed.tar
|
| 78 |
-
tar -xzf gt_mask.tar.gz
|
| 79 |
-
tar -xzf audio_embed.tar.gz
|
| 80 |
-
tar -xf media.tar
|
| 81 |
-
```
|
| 82 |
-
|
| 83 |
-
清理迁移中不需要的缓存:
|
| 84 |
-
|
| 85 |
-
```bash
|
| 86 |
-
cd /workspace/SimToken
|
| 87 |
-
find . -name "__pycache__" -prune -exec rm -rf {} +
|
| 88 |
-
find . -name ".pytest_cache" -prune -exec rm -rf {} +
|
| 89 |
-
find . -name ".cache" -prune -exec rm -rf {} +
|
| 90 |
-
find . -name "*.pyc" -delete
|
| 91 |
-
```
|
| 92 |
-
|
| 93 |
-
---
|
| 94 |
-
|
| 95 |
-
## 3. Download from HuggingFace
|
| 96 |
|
| 97 |
如果新机器不使用迁移工具,而是从 HuggingFace 重新初始化,先登录:
|
| 98 |
|
|
@@ -125,7 +90,7 @@ tar -xf media.tar
|
|
| 125 |
|
| 126 |
---
|
| 127 |
|
| 128 |
-
##
|
| 129 |
|
| 130 |
`transformers==4.30.2` 与新版 `huggingface_hub` 可能存在网络/API 兼容问题。建议先用 CLI 将模型下载到本地缓存,实验时再加 `TRANSFORMERS_OFFLINE=1`。
|
| 131 |
|
|
@@ -148,60 +113,39 @@ TRANSFORMERS_OFFLINE=1 /opt/miniforge3/condabin/conda run -n simtoken \
|
|
| 148 |
|
| 149 |
---
|
| 150 |
|
| 151 |
-
##
|
| 152 |
|
| 153 |
-
|
| 154 |
|
| 155 |
```bash
|
| 156 |
-
cd /workspace/SimToken
|
| 157 |
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
```
|
| 163 |
|
| 164 |
-
|
| 165 |
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
--name amin_full_e1 \
|
| 173 |
-
--init_from_saved_model \
|
| 174 |
-
--epochs 1 \
|
| 175 |
-
--batch_size 2 \
|
| 176 |
-
--lr 1e-4 \
|
| 177 |
-
--saved_model /workspace/SimToken/checkpoints/simtoken_pretrained.pth \
|
| 178 |
-
--log_root /workspace/SimToken/log \
|
| 179 |
-
--checkpoint_root /workspace/SimToken/checkpoints
|
| 180 |
-
```
|
| 181 |
|
| 182 |
-
|
|
|
|
| 183 |
|
| 184 |
-
|
| 185 |
-
initialized training from saved model: /workspace/SimToken/checkpoints/simtoken_pretrained.pth
|
| 186 |
-
missing keys: ... | unexpected keys: ...
|
| 187 |
```
|
| 188 |
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
## 6. Upload to HuggingFace
|
| 192 |
-
|
| 193 |
-
实验结束后,如需重新上传到 HuggingFace,先将数据目录压缩为归档文件,减少文件数量:
|
| 194 |
|
| 195 |
```bash
|
| 196 |
cd /workspace/SimToken/data
|
| 197 |
-
|
| 198 |
-
tar -
|
| 199 |
-
tar -czf gt_mask.tar.gz gt_mask/
|
| 200 |
-
tar -czf audio_embed.tar.gz audio_embed/
|
| 201 |
-
tar -cf media.tar media/
|
| 202 |
-
|
| 203 |
-
ls -lh *.tar*
|
| 204 |
-
rm -rf image_embed/ gt_mask/ audio_embed/ media/
|
| 205 |
```
|
| 206 |
|
| 207 |
清理缓存并上传:
|
|
|
|
| 57 |
|
| 58 |
---
|
| 59 |
|
| 60 |
+
## 2. Download from HuggingFace
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
如果新机器不使用迁移工具,而是从 HuggingFace 重新初始化,先登录:
|
| 63 |
|
|
|
|
| 90 |
|
| 91 |
---
|
| 92 |
|
| 93 |
+
## 3. Pre-download Model Weights
|
| 94 |
|
| 95 |
`transformers==4.30.2` 与新版 `huggingface_hub` 可能存在网络/API 兼容问题。建议先用 CLI 将模型下载到本地缓存,实验时再加 `TRANSFORMERS_OFFLINE=1`。
|
| 96 |
|
|
|
|
| 113 |
|
| 114 |
---
|
| 115 |
|
| 116 |
+
## 4. Upload to HuggingFace
|
| 117 |
|
| 118 |
+
实验结束后,如需重新上传到 HuggingFace,先将数据目录压缩为归档文件,减少文件数量:
|
| 119 |
|
| 120 |
```bash
|
| 121 |
+
cd /workspace/SimToken/data
|
| 122 |
|
| 123 |
+
tar -cf image_embed.tar image_embed/
|
| 124 |
+
tar -czf gt_mask.tar.gz gt_mask/
|
| 125 |
+
tar -czf audio_embed.tar.gz audio_embed/
|
| 126 |
+
tar -cf media.tar media/
|
|
|
|
| 127 |
|
| 128 |
+
ls -lh *.tar*
|
| 129 |
|
| 130 |
+
# HuggingFace 单文件硬限制为 50GB;如果 image_embed.tar 超过 50GB,
|
| 131 |
+
# 需要切成小于 50GB 的分片再上传。
|
| 132 |
+
split -b 45G -d -a 2 image_embed.tar image_embed.tar.part-
|
| 133 |
|
| 134 |
+
# 校验分片拼接后仍能读出完整 tar 文件列表。
|
| 135 |
+
cat image_embed.tar.part-* | tar -tf - | grep -v '/$' | wc -l
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
|
| 137 |
+
# 分片校验通过后再删除超大原始 tar,避免上传失败。
|
| 138 |
+
rm -f image_embed.tar
|
| 139 |
|
| 140 |
+
rm -rf image_embed/ gt_mask/ audio_embed/ media/
|
|
|
|
|
|
|
| 141 |
```
|
| 142 |
|
| 143 |
+
下载后如需恢复 `image_embed.tar`:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 144 |
|
| 145 |
```bash
|
| 146 |
cd /workspace/SimToken/data
|
| 147 |
+
cat image_embed.tar.part-* > image_embed.tar
|
| 148 |
+
tar -xf image_embed.tar
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 149 |
```
|
| 150 |
|
| 151 |
清理缓存并上传:
|