Training on ranks other than 16 causes 'size mismatch for transformer' error

#14
by Vyxodez - opened

I've tried training a lora with default parameters (linear and linear_alpha being 16), and everything went smoothly. But when I tried training on any other rank, both lower and higher, I got a size mismatch error:

Traceback (most recent call last):
File "/workspace/ai-toolkit/run.py", line 90, in
main()
File "/workspace/ai-toolkit/run.py", line 86, in main
raise e
File "/workspace/ai-toolkit/run.py", line 78, in main
job.run()
File "/workspace/ai-toolkit/jobs/ExtensionJob.py", line 22, in run
process.run()
File "/workspace/ai-toolkit/jobs/process/BaseSDTrainProcess.py", line 1533, in run
extra_weights = self.load_weights(latest_save_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/ai-toolkit/jobs/process/BaseSDTrainProcess.py", line 715, in load_weights
extra_weights = self.network.load_weights(path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/ai-toolkit/toolkit/network_mixins.py", line 608, in load_weights
info = self.load_state_dict(load_sd, False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/ai-toolkit/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2581, in load_state_dict
raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for LoRASpecialNetwork:
size mismatch for transformer$$transformer_blocks$$0$$norm1$$linear.lora_down.weight: copying a param with shape torch.Size([16, 3072]) from checkpoint, the shape in current model is torch.Size([8, 3072]).
size mismatch for transformer$$transformer_blocks$$0$$norm1$$linear.lora_up.weight: copying a param with shape torch.Size([18432, 16]) from checkpoint, the shape in current model is torch.Size([18432, 8]).
size mismatch for transformer$$transformer_blocks$$0$$norm1_context$$linear.lora_down.weight: copying a param with shape torch.Size([16, 3072]) from checkpoint, the shape in current model is torch.Size([8, 3072]).
size mismatch for transformer$$transformer_blocks$$0$$norm1_context$$linear.lora_up.weight: copying a param with shape torch.Size([18432, 16]) from checkpoint, the shape in current model is torch.Size([18432, 8]).
size mismatch for transformer$$transformer_blocks$$0$$attn$$to_q.lora_down.weight: copying a param with shape torch.Size([16, 3072]) from checkpoint, the shape in current model is torch.Size([8, 3072]).
...
and so on.

For anyone who will encounter the same issue: I trained a 16-rank lora before that and tried to train other rank lora WITH THE SAME MODEL NAME, and toolkit apparently tried to resume previous operation due to the same name and failed with size mismatch. Changing the name of the model resolved the issue.

Vyxodez changed discussion status to closed

Sign up or log in to comment