Training on ranks other than 16 causes 'size mismatch for transformer' error

#14

by Vyxodez - opened Feb 11, 2025

Feb 11, 2025

I've tried training a lora with default parameters (linear and linear_alpha being 16), and everything went smoothly. But when I tried training on any other rank, both lower and higher, I got a size mismatch error:

Traceback (most recent call last):
File "/workspace/ai-toolkit/run.py", line 90, in
main()
File "/workspace/ai-toolkit/run.py", line 86, in main
raise e
File "/workspace/ai-toolkit/run.py", line 78, in main
job.run()
File "/workspace/ai-toolkit/jobs/ExtensionJob.py", line 22, in run
process.run()
File "/workspace/ai-toolkit/jobs/process/BaseSDTrainProcess.py", line 1533, in run
extra_weights = self.load_weights(latest_save_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/ai-toolkit/jobs/process/BaseSDTrainProcess.py", line 715, in load_weights
extra_weights = self.network.load_weights(path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/ai-toolkit/toolkit/network_mixins.py", line 608, in load_weights
info = self.load_state_dict(load_sd, False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/ai-toolkit/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2581, in load_state_dict
raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for LoRASpecialNetwork:
size mismatch for transformer$$transformer_blocks$$0$$norm1$$linear.lora_down.weight: copying a param with shape torch.Size([16, 3072]) from checkpoint, the shape in current model is torch.Size([8, 3072]).
size mismatch for transformer$$transformer_blocks$$0$$norm1$$linear.lora_up.weight: copying a param with shape torch.Size([18432, 16]) from checkpoint, the shape in current model is torch.Size([18432, 8]).
size mismatch for transformer$$transformer_blocks$$0$$norm1_context$$linear.lora_down.weight: copying a param with shape torch.Size([16, 3072]) from checkpoint, the shape in current model is torch.Size([8, 3072]).
size mismatch for transformer$$transformer_blocks$$0$$norm1_context$$linear.lora_up.weight: copying a param with shape torch.Size([18432, 16]) from checkpoint, the shape in current model is torch.Size([18432, 8]).
size mismatch for transformer$$transformer_blocks$$0$$attn$$to_q.lora_down.weight: copying a param with shape torch.Size([16, 3072]) from checkpoint, the shape in current model is torch.Size([8, 3072]).
...
and so on.

Vyxodez

Feb 11, 2025

For anyone who will encounter the same issue: I trained a 16-rank lora before that and tried to train other rank lora WITH THE SAME MODEL NAME, and toolkit apparently tried to resume previous operation due to the same name and failed with size mismatch. Changing the name of the model resolved the issue.

Vyxodez changed discussion status to closed Feb 11, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment