Training in progress, step 20

fcca8c8 verified 3 months ago

3.93 kB

	---
	title: FAQ
	description: Frequently asked questions
	---

	### General

	Q: The trainer stopped and hasn't progressed in several minutes.

	> A: Usually an issue with the GPUs communicating with each other. See the [NCCL doc](nccl.qmd)

	Q: Exitcode -9

	> A: This usually happens when you run out of system RAM.

	Q: Exitcode -7 while using deepspeed

	> A: Try upgrading deepspeed w: `pip install -U deepspeed`

	Q: AttributeError: 'DummyOptim' object has no attribute 'step'

	Q: ModuleNotFoundError: No module named 'mpi4py' using single GPU with deepspeed

	> A: You may be using deepspeed with single gpu. Please remove the `deepspeed:` section in the yaml file or `--deepspeed` CLI flag.

	Q: The codes is stuck on saving preprocessed datasets.

	> A: This is usually an issue with the GPU. This can be resolved through setting the os environment variable `CUDA_VISIBLE_DEVICES=0`. If you are on runpod, this is usually a pod issue. Starting a new pod should take care of it.

	Q: Received mismatch error on merge adapters / loading adapters between torch.Size of checkpoint and model.

	> A: This is likely due to vocab size mismatch. By default, Axolotl expands the model's embeddings if the tokenizer has more tokens than the model. Please use the `axolotl merge-lora` command to merge the adapters instead of using your own scripts.

	> On the other hand, if the model has more tokens than the tokenizer, Axolotl does not shrink the model's embeddings unless `shrink_embeddings: true` is set in the config.

	Q: How to call Axolotl via custom python scripts?

	> A: Yes, since Axolotl is just Python, please see `src/axolotl/cli/main.py` on how each command is called.

	Q: How to know the value to use for `fsdp_transformer_layer_cls_to_wrap`?

	> A: This is the class name of the transformer layer to wrap with FSDP. For example, for `LlamaForCausalLM`, the value is `LlamaDecoderLayer`. To find this for a specific model, check the model's `PreTrainedModel` definition and look for `_no_split_modules` variable in the `modeling_<model_name>.py` file within `transformers` library.

	### Chat templates

	Q: `jinja2.exceptions.UndefinedError: 'dict object' has no attribute 'content' / 'role' / ____`

	> A: This means that the property mapping for the stated attribute does not exist when building `chat_template` prompt. For example, if `no attribute 'content'`, please check you have added the correct mapping for `content` under `message_property_mappings`.

	Q: `Empty template generated for turn ___`

	> A: The `content` is empty for that turn.

	Q: `Could not find content start/end boundary for turn __`

	> A: The specific turn's start/end could not be detected. Please ensure you have set the `eos_token` following your `chat_template`. Otherwise, this could be a `chat_template` which doesn't use proper boundaries for each turn (like system). On the rare occurrence, make sure your content is not `[[dummy_message]]`. Please let us know about this.

	Q: `Content end boundary is before start boundary for turn ___`

	> A: This is an edge case which should not occur. Please create an Issue if this happens.

	Q: `Content end boundary is the same as start boundary for turn ___. This is likely an empty turn.`

	> A: This is likely an empty turn.

	Q: The EOS/EOT token is incorrectly being masked or not being masked.

	> A: This is because of the mismatch between `tokenizer.eos_token` and EOS/EOT token in template. Please make sure to set `eos_token` under `special_tokens` to the same EOS/EOT token as in template.

	Q: "`chat_template` choice is `tokenizer_default` but tokenizer's `chat_template` is null. Please add a `chat_template` in tokenizer config"

	> A: This is because the tokenizer does not have a chat template. Please add a chat template in the tokenizer config. See [chat_template](dataset-formats/conversation.qmd#chat-template) for more details.