wcs2024's picture
Upload folder using huggingface_hub
adb8cd1 verified
/egr/research-optml/wangc168/anaconda3/envs/py310/lib/python3.10/site-packages/transformers/utils/hub.py:127: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
Downloading shards: 0%| | 0/8 [00:00<?, ?it/s] Downloading shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00, 2715.42it/s]
Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 12%|β–ˆβ–Ž | 1/8 [00:00<00:02, 2.59it/s] Loading checkpoint shards: 25%|β–ˆβ–ˆβ–Œ | 2/8 [00:00<00:02, 2.66it/s] Loading checkpoint shards: 38%|β–ˆβ–ˆβ–ˆβ–Š | 3/8 [00:01<00:01, 2.55it/s] Loading checkpoint shards: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 4/8 [00:01<00:01, 2.57it/s] Loading checkpoint shards: 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 5/8 [00:01<00:01, 2.51it/s] Loading checkpoint shards: 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 6/8 [00:02<00:00, 2.50it/s] Loading checkpoint shards: 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 7/8 [00:02<00:00, 2.48it/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:03<00:00, 2.82it/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:03<00:00, 2.64it/s]
Downloading shards: 0%| | 0/8 [00:00<?, ?it/s] Downloading shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:00<00:00, 3622.02it/s]
Loading checkpoint shards: 0%| | 0/8 [00:00<?, ?it/s] Loading checkpoint shards: 12%|β–ˆβ–Ž | 1/8 [00:00<00:02, 2.70it/s] Loading checkpoint shards: 25%|β–ˆβ–ˆβ–Œ | 2/8 [00:00<00:02, 2.74it/s] Loading checkpoint shards: 38%|β–ˆβ–ˆβ–ˆβ–Š | 3/8 [00:01<00:01, 2.71it/s] Loading checkpoint shards: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 4/8 [00:01<00:01, 2.66it/s] Loading checkpoint shards: 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 5/8 [00:01<00:01, 2.55it/s] Loading checkpoint shards: 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 6/8 [00:02<00:00, 2.53it/s] Loading checkpoint shards: 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 7/8 [00:02<00:00, 2.50it/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:02<00:00, 2.92it/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 8/8 [00:02<00:00, 2.72it/s]
wandb: Currently logged in as: wangc168 (wcs23187) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: setting up run kstsc84q
wandb: Tracking run with wandb version 0.23.1
wandb: Run data is saved locally in /egr/research-optml/wangc168/Muon_wmdp/rmu_wmdp/wandb/run-20260403_161129-kstsc84q
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run muon-v1-rank-models/muon_v1_test_V1_K64_batches150_bs4_alpha1200-1200_steer15-15_muonlr2.6e-3_muonmom0.95_muonwd0.1_adamlr5e-5_seed42_layer7_layers5-6-7_params6
wandb: ⭐️ View project at https://wandb.ai/wcs23187/Muon-constraint
wandb: πŸš€ View run at https://wandb.ai/wcs23187/Muon-constraint/runs/kstsc84q
====rmu Config====
model_name_or_path=HuggingFaceH4/zephyr-7b-beta
module_str={model_name}.model.layers[{layer_id}]
output_dir=models/muon_v1_test_V1_K64_batches150_bs4_alpha1200-1200_steer15-15_muonlr2.6e-3_muonmom0.95_muonwd0.1_adamlr5e-5_seed42_layer7_layers5-6-7_params6
retain_corpora=['wikitext', 'wikitext']
forget_corpora=['bio_remove_dataset', 'cyber-forget-corpus']
alpha=[1200.0, 1200.0]
steering_coeffs=15,15
muon_lr=0.0026
muon_momentum=0.95
muon_weight_decay=0.1
muon_ns_steps=5
muon_nesterov=True
adam_emb_lr=5e-05
adam_scalar_lr=5e-05
enable_muon_svd_logging=False
M_stats_path=
M_check_T=0
V1_K=64
min_len=0
max_len=2000
batch_size=4
max_num_batches=150
layer_id=7
layer_ids=[5, 6, 7]
param_ids=[6]
seed=42
verbose=False
steering_coeff_list=[15.0, 15.0]
=====
MUON model.layers.5.mlp.down_proj.weight (4096, 14336)
MUON model.layers.6.mlp.down_proj.weight (4096, 14336)
MUON model.layers.7.mlp.down_proj.weight (4096, 14336)
======= Epoch 0 =======
0%| | 0/150 [00:00<?, ?it/s]We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
/egr/research-optml/wangc168/Muon_wmdp/rmu_wmdp/rmu_wb/unlearn_muon_v1.py:201: UserWarning: Using a target size (torch.Size([1, 1, 4096])) that is different to the input size (torch.Size([4, 512, 4096])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
unlearn_loss = torch.nn.functional.mse_loss(
1%| | 1/150 [00:04<12:20, 4.97s/it]/egr/research-optml/wangc168/Muon_wmdp/rmu_wmdp/rmu_wb/unlearn_muon_v1.py:201: UserWarning: Using a target size (torch.Size([1, 1, 4096])) that is different to the input size (torch.Size([4, 768, 4096])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
unlearn_loss = torch.nn.functional.mse_loss(
1%|▏ | 2/150 [00:13<16:51, 6.84s/it] 2%|▏ | 3/150 [00:21<18:04, 7.38s/it] 3%|β–Ž | 4/150 [00:29<18:34, 7.63s/it] 3%|β–Ž | 5/150 [00:37<18:45, 7.76s/it] 4%|▍ | 6/150 [00:45<18:48, 7.84s/it] 5%|▍ | 7/150 [00:53<18:48, 7.89s/it] 5%|β–Œ | 8/150 [01:01<18:43, 7.91s/it] 6%|β–Œ | 9/150 [01:09<18:40, 7.94s/it] 7%|β–‹ | 10/150 [01:17<18:32, 7.95s/it] 7%|β–‹ | 11/150 [01:25<18:28, 7.97s/it] 8%|β–Š | 12/150 [01:33<18:21, 7.98s/it] 9%|β–Š | 13/150 [01:41<18:15, 8.00s/it] 9%|β–‰ | 14/150 [01:49<18:10, 8.02s/it] 10%|β–ˆ | 15/150 [01:57<17:57, 7.98s/it] 11%|β–ˆ | 16/150 [02:04<17:46, 7.96s/it] 11%|β–ˆβ– | 17/150 [02:12<17:36, 7.95s/it] 12%|β–ˆβ– | 18/150 [02:20<17:29, 7.95s/it] 13%|β–ˆβ–Ž | 19/150 [02:28<17:19, 7.94s/it] 13%|β–ˆβ–Ž | 20/150 [02:36<17:10, 7.93s/it] 14%|β–ˆβ– | 21/150 [02:44<17:02, 7.92s/it] 15%|β–ˆβ– | 22/150 [02:52<16:55, 7.94s/it] 15%|β–ˆβ–Œ | 23/150 [03:00<16:47, 7.93s/it] 16%|β–ˆβ–Œ | 24/150 [03:08<16:37, 7.92s/it] 17%|β–ˆβ–‹ | 25/150 [03:16<16:33, 7.94s/it] 17%|β–ˆβ–‹ | 26/150 [03:24<16:24, 7.94s/it] 18%|β–ˆβ–Š | 27/150 [03:32<16:17, 7.94s/it] 19%|β–ˆβ–Š | 28/150 [03:40<16:09, 7.95s/it] 19%|β–ˆβ–‰ | 29/150 [03:48<15:59, 7.93s/it] 20%|β–ˆβ–ˆ | 30/150 [03:56<15:50, 7.92s/it] 21%|β–ˆβ–ˆ | 31/150 [04:04<15:45, 7.95s/it] 21%|β–ˆβ–ˆβ– | 32/150 [04:11<15:37, 7.94s/it] 22%|β–ˆβ–ˆβ– | 33/150 [04:19<15:29, 7.94s/it] 23%|β–ˆβ–ˆβ–Ž | 34/150 [04:27<15:21, 7.95s/it] 23%|β–ˆβ–ˆβ–Ž | 35/150 [04:35<15:17, 7.98s/it] 24%|β–ˆβ–ˆβ– | 36/150 [04:43<15:07, 7.96s/it] 25%|β–ˆβ–ˆβ– | 37/150 [04:51<15:00, 7.97s/it] 25%|β–ˆβ–ˆβ–Œ | 38/150 [04:59<14:53, 7.98s/it] 26%|β–ˆβ–ˆβ–Œ | 39/150 [05:07<14:41, 7.94s/it] 27%|β–ˆβ–ˆβ–‹ | 40/150 [05:15<14:33, 7.94s/it] 27%|β–ˆβ–ˆβ–‹ | 41/150 [05:23<14:26, 7.95s/it] 28%|β–ˆβ–ˆβ–Š | 42/150 [05:31<14:19, 7.96s/it] 29%|β–ˆβ–ˆβ–Š | 43/150 [05:39<14:10, 7.95s/it] 29%|β–ˆβ–ˆβ–‰ | 44/150 [05:47<14:02, 7.95s/it] 30%|β–ˆβ–ˆβ–ˆ | 45/150 [05:55<13:54, 7.95s/it] 31%|β–ˆβ–ˆβ–ˆ | 46/150 [06:03<13:46, 7.95s/it] 31%|β–ˆβ–ˆβ–ˆβ– | 47/150 [06:11<13:46, 8.02s/it] 32%|β–ˆβ–ˆβ–ˆβ– | 48/150 [06:19<13:33, 7.97s/it] 33%|β–ˆβ–ˆβ–ˆβ–Ž | 49/150 [06:27<13:22, 7.95s/it] 33%|β–ˆβ–ˆβ–ˆβ–Ž | 50/150 [06:35<13:15, 7.96s/it] 34%|β–ˆβ–ˆβ–ˆβ– | 51/150 [06:43<13:05, 7.94s/it] 35%|β–ˆβ–ˆβ–ˆβ– | 52/150 [06:51<12:56, 7.93s/it] 35%|β–ˆβ–ˆβ–ˆβ–Œ | 53/150 [06:58<12:49, 7.93s/it] 36%|β–ˆβ–ˆβ–ˆβ–Œ | 54/150 [07:06<12:41, 7.93s/it] 37%|β–ˆβ–ˆβ–ˆβ–‹ | 55/150 [07:14<12:33, 7.94s/it] 37%|β–ˆβ–ˆβ–ˆβ–‹ | 56/150 [07:22<12:26, 7.94s/it] 38%|β–ˆβ–ˆβ–ˆβ–Š | 57/150 [07:30<12:18, 7.95s/it] 39%|β–ˆβ–ˆβ–ˆβ–Š | 58/150 [07:38<12:12, 7.96s/it] 39%|β–ˆβ–ˆβ–ˆβ–‰ | 59/150 [07:46<12:05, 7.97s/it] 40%|β–ˆβ–ˆβ–ˆβ–ˆ | 60/150 [07:54<11:59, 7.99s/it] 41%|β–ˆβ–ˆβ–ˆβ–ˆ | 61/150 [08:02<11:52, 8.00s/it]now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1377 | unlearn_loss: 0.1377 | retain_loss: 0 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.0791 | unlearn_loss: 0.0791 | retain_loss: 4.816e-05 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1426 | unlearn_loss: 0.1416 | retain_loss: 0.001022 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07959 | unlearn_loss: 0.0791 | retain_loss: 0.0006828 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1387 | unlearn_loss: 0.1377 | retain_loss: 0.0008354 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07959 | unlearn_loss: 0.07861 | retain_loss: 0.0009842 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1357 | unlearn_loss: 0.1357 | retain_loss: 0.0004139 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07959 | unlearn_loss: 0.07861 | retain_loss: 0.0009918 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1416 | unlearn_loss: 0.1406 | retain_loss: 0.0009003 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.0791 | unlearn_loss: 0.07861 | retain_loss: 0.0004692 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1377 | unlearn_loss: 0.1367 | retain_loss: 0.001373 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.0791 | unlearn_loss: 0.07861 | retain_loss: 0.0004501 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1377 | unlearn_loss: 0.1367 | retain_loss: 0.0005684 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.0791 | unlearn_loss: 0.07861 | retain_loss: 0.0007057 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1436 | unlearn_loss: 0.1396 | retain_loss: 0.003998 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.08008 | unlearn_loss: 0.07861 | retain_loss: 0.001457 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1377 | unlearn_loss: 0.1367 | retain_loss: 0.001244 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.08008 | unlearn_loss: 0.07861 | retain_loss: 0.001556 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1396 | unlearn_loss: 0.1387 | retain_loss: 0.001022 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.0791 | unlearn_loss: 0.07812 | retain_loss: 0.0009079 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1406 | unlearn_loss: 0.1377 | retain_loss: 0.003036 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07959 | unlearn_loss: 0.07812 | retain_loss: 0.001228 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1406 | unlearn_loss: 0.1387 | retain_loss: 0.001526 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.0791 | unlearn_loss: 0.07812 | retain_loss: 0.0009193 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1406 | unlearn_loss: 0.1396 | retain_loss: 0.001152 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.0791 | unlearn_loss: 0.07812 | retain_loss: 0.001007 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1387 | unlearn_loss: 0.1357 | retain_loss: 0.003036 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.08008 | unlearn_loss: 0.07812 | retain_loss: 0.001801 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1396 | unlearn_loss: 0.1367 | retain_loss: 0.002579 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.0791 | unlearn_loss: 0.07764 | retain_loss: 0.001518 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1377 | unlearn_loss: 0.1367 | retain_loss: 0.001404 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07861 | unlearn_loss: 0.07764 | retain_loss: 0.001099 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1387 | unlearn_loss: 0.1357 | retain_loss: 0.002899 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.08008 | unlearn_loss: 0.07812 | retain_loss: 0.002121 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1406 | unlearn_loss: 0.1387 | retain_loss: 0.001907 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07959 | unlearn_loss: 0.07812 | retain_loss: 0.001358 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1416 | unlearn_loss: 0.1367 | retain_loss: 0.004425 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.08057 | unlearn_loss: 0.07812 | retain_loss: 0.002625 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1426 | unlearn_loss: 0.1396 | retain_loss: 0.002457 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07959 | unlearn_loss: 0.07764 | retain_loss: 0.001823 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1416 | unlearn_loss: 0.1367 | retain_loss: 0.004456 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.08008 | unlearn_loss: 0.07715 | retain_loss: 0.002975 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1455 | unlearn_loss: 0.1416 | retain_loss: 0.003616 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07959 | unlearn_loss: 0.07715 | retain_loss: 0.002594 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1416 | unlearn_loss: 0.1387 | retain_loss: 0.002579 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.0791 | unlearn_loss: 0.07715 | retain_loss: 0.002182 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1445 | unlearn_loss: 0.1396 | retain_loss: 0.004578 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.08057 | unlearn_loss: 0.07764 | retain_loss: 0.002914 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1416 | unlearn_loss: 0.1377 | retain_loss: 0.00351 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07959 | unlearn_loss: 0.07715 | retain_loss: 0.002396 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1396 | unlearn_loss: 0.1367 | retain_loss: 0.003021 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07861 | unlearn_loss: 0.07666 | retain_loss: 0.001945 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1377 | unlearn_loss: 0.1348 | retain_loss: 0.002563 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.0791 | unlearn_loss: 0.07715 | retain_loss: 0.001762 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1416 | unlearn_loss: 0.1338 | retain_loss: 0.00824 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.08252 | unlearn_loss: 0.07617 | retain_loss: 0.006256 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1279 | unlearn_loss: 0.1226 | retain_loss: 0.005066 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.08057 | unlearn_loss: 0.07617 | retain_loss: 0.004272 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1406 | unlearn_loss: 0.1309 | retain_loss: 0.009583 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.08447 | unlearn_loss: 0.07617 | retain_loss: 0.008179 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1396 | unlearn_loss: 0.1309 | retain_loss: 0.008667 | param_change: 0
now we use the mode .. v1 41%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 62/150 [08:10<11:45, 8.02s/it] 42%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 63/150 [08:18<11:38, 8.03s/it] 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 64/150 [08:26<11:31, 8.05s/it] 43%|β–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 65/150 [08:34<11:21, 8.01s/it] 44%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 66/150 [08:42<11:12, 8.01s/it] 45%|β–ˆβ–ˆβ–ˆβ–ˆβ– | 67/150 [08:50<11:03, 7.99s/it] 45%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 68/150 [08:58<10:54, 7.98s/it] 46%|β–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 69/150 [09:06<10:46, 7.98s/it] 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 70/150 [09:14<10:37, 7.97s/it] 47%|β–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 71/150 [09:22<10:29, 7.96s/it] 48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 72/150 [09:30<10:21, 7.96s/it] 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š | 73/150 [09:38<10:14, 7.98s/it] 49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 74/150 [09:46<10:04, 7.95s/it] 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 75/150 [09:54<09:59, 7.99s/it] 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 76/150 [10:02<09:51, 7.99s/it] 51%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 77/150 [10:10<09:41, 7.97s/it] 52%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 78/150 [10:18<09:32, 7.95s/it] 53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 79/150 [10:26<09:24, 7.95s/it]/egr/research-optml/wangc168/Muon_wmdp/rmu_wmdp/rmu_wb/unlearn_muon_v1.py:201: UserWarning: Using a target size (torch.Size([1, 1, 4096])) that is different to the input size (torch.Size([4, 679, 4096])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
unlearn_loss = torch.nn.functional.mse_loss(
53%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 80/150 [10:34<09:15, 7.94s/it] 54%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 81/150 [10:42<09:09, 7.97s/it] 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 82/150 [10:50<09:01, 7.97s/it] 55%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 83/150 [10:58<08:53, 7.96s/it] 56%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 84/150 [11:06<08:44, 7.95s/it] 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 85/150 [11:14<08:37, 7.96s/it] 57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 86/150 [11:22<08:29, 7.96s/it] 58%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 87/150 [11:30<08:20, 7.95s/it] 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 88/150 [11:38<08:12, 7.94s/it] 59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 89/150 [11:45<08:04, 7.94s/it] 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 90/150 [11:53<07:56, 7.94s/it] 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 91/150 [12:01<07:48, 7.94s/it] 61%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 92/150 [12:09<07:40, 7.95s/it] 62%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 93/150 [12:17<07:32, 7.93s/it] 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 94/150 [12:25<07:23, 7.92s/it] 63%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 95/150 [12:33<07:17, 7.95s/it] 64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 96/150 [12:41<07:09, 7.95s/it] 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 97/150 [12:49<07:02, 7.97s/it] 65%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 98/150 [12:57<06:54, 7.98s/it] 66%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 99/150 [13:05<06:45, 7.96s/it] 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 100/150 [13:13<06:38, 7.98s/it] 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 101/150 [13:21<06:30, 7.96s/it] 68%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 102/150 [13:29<06:22, 7.97s/it] 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 103/150 [13:37<06:14, 7.97s/it] 69%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 104/150 [13:45<06:06, 7.96s/it] 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 105/150 [13:53<05:58, 7.97s/it] 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 106/150 [14:01<05:51, 7.99s/it] 71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 107/150 [14:09<05:43, 7.98s/it] 72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 108/150 [14:17<05:35, 8.00s/it] 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 109/150 [14:25<05:26, 7.97s/it] 73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 110/150 [14:33<05:23, 8.08s/it] 74%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 111/150 [14:41<05:12, 8.02s/it] 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 112/150 [14:49<05:04, 8.00s/it] 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 113/150 [14:57<04:55, 7.99s/it] 76%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 114/150 [15:05<04:47, 7.98s/it] 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 115/150 [15:13<04:38, 7.97s/it] 77%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 116/150 [15:21<04:31, 7.98s/it] 78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 117/150 [15:29<04:23, 7.99s/it] 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 118/150 [15:37<04:16, 8.00s/it] 79%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 119/150 [15:45<04:07, 7.99s/it] 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 120/150 [15:53<03:59, 7.98s/it] 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 121/150 [16:01<03:51, 7.98s/it] 81%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 122/150 [16:09<03:43, 7.98s/it]
k_eff 64
k_eff 64
k_eff 64
loss: 0.08398 | unlearn_loss: 0.07666 | retain_loss: 0.007538 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1416 | unlearn_loss: 0.1328 | retain_loss: 0.008362 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.08301 | unlearn_loss: 0.07568 | retain_loss: 0.007507 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1426 | unlearn_loss: 0.1377 | retain_loss: 0.004974 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07959 | unlearn_loss: 0.0752 | retain_loss: 0.004547 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1367 | unlearn_loss: 0.1318 | retain_loss: 0.004517 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07812 | unlearn_loss: 0.07422 | retain_loss: 0.00386 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1318 | unlearn_loss: 0.127 | retain_loss: 0.004547 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07861 | unlearn_loss: 0.07471 | retain_loss: 0.003769 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1309 | unlearn_loss: 0.127 | retain_loss: 0.00415 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07764 | unlearn_loss: 0.07373 | retain_loss: 0.003708 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1289 | unlearn_loss: 0.125 | retain_loss: 0.004395 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07666 | unlearn_loss: 0.07275 | retain_loss: 0.003891 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1357 | unlearn_loss: 0.1279 | retain_loss: 0.007477 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07715 | unlearn_loss: 0.07178 | retain_loss: 0.005463 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1318 | unlearn_loss: 0.126 | retain_loss: 0.005585 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07568 | unlearn_loss: 0.07129 | retain_loss: 0.004608 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1289 | unlearn_loss: 0.1245 | retain_loss: 0.004852 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07764 | unlearn_loss: 0.07324 | retain_loss: 0.004211 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.124 | unlearn_loss: 0.1196 | retain_loss: 0.004608 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07471 | unlearn_loss: 0.07031 | retain_loss: 0.004242 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.126 | unlearn_loss: 0.1196 | retain_loss: 0.006012 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07422 | unlearn_loss: 0.06934 | retain_loss: 0.004913 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1309 | unlearn_loss: 0.127 | retain_loss: 0.004333 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07275 | unlearn_loss: 0.06885 | retain_loss: 0.003906 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1216 | unlearn_loss: 0.1177 | retain_loss: 0.003967 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07129 | unlearn_loss: 0.06787 | retain_loss: 0.003448 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1245 | unlearn_loss: 0.1211 | retain_loss: 0.003632 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07129 | unlearn_loss: 0.06787 | retain_loss: 0.003189 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1162 | unlearn_loss: 0.1128 | retain_loss: 0.003662 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07178 | unlearn_loss: 0.06885 | retain_loss: 0.003113 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.125 | unlearn_loss: 0.1191 | retain_loss: 0.006256 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.0708 | unlearn_loss: 0.06592 | retain_loss: 0.004822 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1172 | unlearn_loss: 0.1113 | retain_loss: 0.005798 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07324 | unlearn_loss: 0.06836 | retain_loss: 0.004913 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.126 | unlearn_loss: 0.1206 | retain_loss: 0.005646 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.07178 | unlearn_loss: 0.06689 | retain_loss: 0.004944 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.124 | unlearn_loss: 0.1196 | retain_loss: 0.004517 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.06787 | unlearn_loss: 0.06396 | retain_loss: 0.00386 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1191 | unlearn_loss: 0.1128 | retain_loss: 0.006439 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.06982 | unlearn_loss: 0.06445 | retain_loss: 0.005249 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1167 | unlearn_loss: 0.1123 | retain_loss: 0.004242 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.06592 | unlearn_loss: 0.0625 | retain_loss: 0.003296 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1104 | unlearn_loss: 0.1055 | retain_loss: 0.0047 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.06689 | unlearn_loss: 0.06299 | retain_loss: 0.003937 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.106 | unlearn_loss: 0.1021 | retain_loss: 0.004059 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.06177 | unlearn_loss: 0.05811 | retain_loss: 0.003555 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1157 | unlearn_loss: 0.1113 | retain_loss: 0.004517 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.06079 | unlearn_loss: 0.05688 | retain_loss: 0.00383 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.105 | unlearn_loss: 0.1011 | retain_loss: 0.00412 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.06055 | unlearn_loss: 0.05688 | retain_loss: 0.00354 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1094 | unlearn_loss: 0.105 | retain_loss: 0.004242 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.06006 | unlearn_loss: 0.05615 | retain_loss: 0.003815 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.104 | unlearn_loss: 0.1001 | retain_loss: 0.003906 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.05859 | unlearn_loss: 0.05518 | retain_loss: 0.003357 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1128 | unlearn_loss: 0.1084 | retain_loss: 0.004456 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.05664 | unlearn_loss: 0.05298 | retain_loss: 0.003769 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1123 | unlearn_loss: 0.1084 | retain_loss: 0.003708 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.05615 | unlearn_loss: 0.05298 | retain_loss: 0.003235 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.124 | unlearn_loss: 0.1191 | retain_loss: 0.004852 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.05591 | unlearn_loss: 0.05176 | retain_loss: 0.004211 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 123/150 [16:17<03:35, 7.97s/it] 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 124/150 [16:25<03:27, 7.99s/it] 83%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž | 125/150 [16:33<03:19, 7.99s/it] 84%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 126/150 [16:41<03:11, 7.98s/it] 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 127/150 [16:49<03:03, 7.96s/it] 85%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 128/150 [16:57<02:55, 7.98s/it] 86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 129/150 [17:04<02:47, 7.96s/it] 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 130/150 [17:12<02:39, 7.96s/it] 87%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 131/150 [17:20<02:31, 7.95s/it] 88%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 132/150 [17:28<02:23, 7.94s/it] 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 133/150 [17:36<02:14, 7.93s/it] 89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 134/150 [17:44<02:07, 7.95s/it] 90%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 135/150 [17:52<01:59, 7.95s/it] 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 136/150 [18:00<01:51, 8.00s/it] 91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 137/150 [18:08<01:44, 8.01s/it] 92%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 138/150 [18:17<01:37, 8.15s/it] 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 139/150 [18:25<01:29, 8.11s/it] 93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 140/150 [18:33<01:20, 8.07s/it] 94%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 141/150 [18:41<01:12, 8.03s/it] 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–| 142/150 [18:49<01:04, 8.01s/it] 95%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 143/150 [18:57<00:55, 7.99s/it] 96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ| 144/150 [19:05<00:47, 7.97s/it] 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 145/150 [19:12<00:39, 7.95s/it] 97%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 146/150 [19:20<00:31, 7.95s/it] 98%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 147/150 [19:28<00:23, 7.96s/it] 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š| 148/150 [19:36<00:15, 7.96s/it] 99%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰| 149/150 [19:44<00:07, 7.94s/it] 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 150/150 [19:52<00:00, 7.94s/it] 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 150/150 [19:52<00:00, 7.95s/it]
wandb: updating run metadata
wandb: uploading config.yaml
wandb:
wandb: Run history:
wandb: loss β–„β–„β–„β–‡β–„β–ˆβ–ˆβ–„β–„β–„β–„β–ƒβ–„β–ˆβ–ˆβ–„β–ˆβ–„β–„β–ˆβ–‡β–ƒβ–‡β–ƒβ–†β–ƒβ–†β–‡β–†β–ƒβ–‚β–…β–†β–‚β–‚β–β–‚β–…β–β–
wandb: param_change ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: retain_loss β–β–β–β–β–β–‚β–β–‚β–‚β–‚β–‚β–‚β–‚β–β–‚β–ƒβ–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–‚β–ˆ
wandb: step β–β–β–β–β–β–β–‚β–‚β–‚β–‚β–‚β–‚β–ƒβ–ƒβ–ƒβ–ƒβ–„β–„β–„β–„β–„β–„β–…β–…β–…β–…β–…β–†β–†β–†β–†β–‡β–‡β–‡β–‡β–‡β–‡β–ˆβ–ˆβ–ˆ
wandb: unlearn_loss β–ƒβ–ˆβ–ƒβ–ˆβ–ƒβ–ƒβ–ˆβ–ˆβ–ƒβ–ˆβ–ƒβ–ƒβ–ƒβ–ˆβ–ƒβ–ˆβ–ˆβ–ƒβ–ƒβ–‡β–‡β–†β–ƒβ–ƒβ–ƒβ–†β–ƒβ–…β–…β–‚β–‚β–…β–…β–†β–†β–„β–β–…β–„β–„
wandb:
wandb: Run summary:
wandb: loss 0.06348
wandb: param_change 0
wandb: retain_loss 0.0249
wandb: step 149
wandb: unlearn_loss 0.03882
wandb:
wandb: πŸš€ View run muon-v1-rank-models/muon_v1_test_V1_K64_batches150_bs4_alpha1200-1200_steer15-15_muonlr2.6e-3_muonmom0.95_muonwd0.1_adamlr5e-5_seed42_layer7_layers5-6-7_params6 at: https://wandb.ai/wcs23187/Muon-constraint/runs/kstsc84q
wandb: ⭐️ View project at: https://wandb.ai/wcs23187/Muon-constraint
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20260403_161129-kstsc84q/logs
loss: 0.103 | unlearn_loss: 0.09912 | retain_loss: 0.003784 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.05322 | unlearn_loss: 0.05005 | retain_loss: 0.003296 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1099 | unlearn_loss: 0.1055 | retain_loss: 0.004242 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.05298 | unlearn_loss: 0.04956 | retain_loss: 0.003494 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1064 | unlearn_loss: 0.1025 | retain_loss: 0.003967 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.05298 | unlearn_loss: 0.04956 | retain_loss: 0.003494 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.09619 | unlearn_loss: 0.0918 | retain_loss: 0.004456 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.05127 | unlearn_loss: 0.04736 | retain_loss: 0.003891 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1006 | unlearn_loss: 0.09619 | retain_loss: 0.004608 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.05029 | unlearn_loss: 0.04663 | retain_loss: 0.003723 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.09863 | unlearn_loss: 0.09473 | retain_loss: 0.003815 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.05273 | unlearn_loss: 0.04956 | retain_loss: 0.00325 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.09961 | unlearn_loss: 0.09619 | retain_loss: 0.003571 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.04468 | unlearn_loss: 0.04175 | retain_loss: 0.002975 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.09668 | unlearn_loss: 0.09229 | retain_loss: 0.004425 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.04419 | unlearn_loss: 0.04053 | retain_loss: 0.003708 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1025 | unlearn_loss: 0.09863 | retain_loss: 0.003754 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.04395 | unlearn_loss: 0.04077 | retain_loss: 0.003204 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.08496 | unlearn_loss: 0.08105 | retain_loss: 0.00386 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.04517 | unlearn_loss: 0.04199 | retain_loss: 0.003098 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.09717 | unlearn_loss: 0.09326 | retain_loss: 0.004089 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.04492 | unlearn_loss: 0.0415 | retain_loss: 0.003464 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1064 | unlearn_loss: 0.1011 | retain_loss: 0.005493 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.04321 | unlearn_loss: 0.03882 | retain_loss: 0.004303 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.1064 | unlearn_loss: 0.1006 | retain_loss: 0.005829 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.04443 | unlearn_loss: 0.03955 | retain_loss: 0.004913 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.125 | unlearn_loss: 0.09082 | retain_loss: 0.03394 | param_change: 0
now we use the mode .. v1
k_eff 64
k_eff 64
k_eff 64
loss: 0.06348 | unlearn_loss: 0.03882 | retain_loss: 0.0249 | param_change: 0
Saved model to models/muon_v1_test_V1_K64_batches150_bs4_alpha1200-1200_steer15-15_muonlr2.6e-3_muonmom0.95_muonwd0.1_adamlr5e-5_seed42_layer7_layers5-6-7_params6
Saved model path to /egr/research-optml/wangc168/Muon_wmdp/rmu_wmdp/model_dir.txt with index 298