Device assertion errors with transformers

#1
by hell0ks - opened

Hello,

I tried run the model with transformers library, but I'm getting error messages like below:

/pytorch/aten/src/ATen/native/cuda/IndexKernelUtils.cu:16: vectorized_gather_kernel: block: [219,0,0], thread: [192,0,0] Assertion
`ind >=0 && ind < ind_dim_size && "vectorized gather kernel index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernelUtils.cu:16: vectorized_gather_kernel: block: [219,0,0], thread: [193,0,0] Assertion
`ind >=0 && ind < ind_dim_size && "vectorized gather kernel index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernelUtils.cu:16: vectorized_gather_kernel: block: [219,0,0], thread: [194,0,0] Assertion
`ind >=0 && ind < ind_dim_size && "vectorized gather kernel index out of bounds"` failed.
....
β”‚ /home/user01/transfr-env/lib/python3.12/site-packages/transformers/utils/generic.py:918 in      β”‚
β”‚ wrapper                                                                                          β”‚
β”‚                                                                                                  β”‚
β”‚    915 β”‚   β”‚   return_dict_passed = kwargs.pop("return_dict", return_dict)                       β”‚
β”‚    916 β”‚   β”‚   if return_dict_passed is not None:                                                β”‚
β”‚    917 β”‚   β”‚   β”‚   return_dict = return_dict_passed                                              β”‚
β”‚ ❱  918 β”‚   β”‚   output = func(self, *args, **kwargs)                                              β”‚
β”‚    919 β”‚   β”‚   if not return_dict and not isinstance(output, tuple):                             β”‚
β”‚    920 β”‚   β”‚   β”‚   output = output.to_tuple()                                                    β”‚
β”‚    921 β”‚   β”‚   return output                                                                     β”‚
β”‚                                                                                                  β”‚
β”‚ /home/user01/.cache/huggingface/modules/transformers_modules/tri21b_hyphen_think/modeling_trill β”‚
β”‚ ion.py:556 in forward                                                                            β”‚
β”‚                                                                                                  β”‚
β”‚   553 β”‚   β”‚   if position_ids is None:                                                           β”‚
β”‚   554 β”‚   β”‚   β”‚   position_ids = cache_position.unsqueeze(0)                                     β”‚
β”‚   555 β”‚   β”‚                                                                                      β”‚
β”‚ ❱ 556 β”‚   β”‚   causal_mask = self._update_causal_mask(                                            β”‚
β”‚   557 β”‚   β”‚   β”‚   attention_mask, inputs_embeds, cache_position, past_key_values, output_atten   β”‚
β”‚   558 β”‚   β”‚   )                                                                                  β”‚
β”‚   559                                                                                            β”‚
β”‚                                                                                                  β”‚
β”‚ /home/user01/.cache/huggingface/modules/transformers_modules/tri21b_hyphen_think/modeling_trill β”‚
β”‚ ion.py:641 in _update_causal_mask                                                                β”‚
β”‚                                                                                                  β”‚
β”‚   638 β”‚   β”‚                                                                                      β”‚
β”‚   639 β”‚   β”‚   # When output attentions is True, sdpa implementation's forward method calls the   β”‚
β”‚   640 β”‚   β”‚   if self.config._attn_implementation == "sdpa" and not using_static_cache and not   β”‚
β”‚ ❱ 641 β”‚   β”‚   β”‚   if AttentionMaskConverter._ignore_causal_mask_sdpa(                            β”‚
β”‚   642 β”‚   β”‚   β”‚   β”‚   attention_mask,                                                            β”‚
β”‚   643 β”‚   β”‚   β”‚   β”‚   inputs_embeds=input_tensor,                                                β”‚
β”‚   644 β”‚   β”‚   β”‚   β”‚   past_key_values_length=past_seen_tokens,                                   β”‚
β”‚                                                                                                  β”‚
β”‚ /home/user01/transfr-env/lib/python3.12/site-packages/transformers/modeling_attn_mask_utils.py: β”‚
β”‚ 294 in _ignore_causal_mask_sdpa                                                                  β”‚
β”‚                                                                                                  β”‚
β”‚   291 β”‚   β”‚   elif sliding_window is None or key_value_length < sliding_window:                  β”‚
β”‚   292 β”‚   β”‚   β”‚   if len(attention_mask.shape) == 4:                                             β”‚
β”‚   293 β”‚   β”‚   β”‚   β”‚   return False                                                               β”‚
β”‚ ❱ 294 β”‚   β”‚   β”‚   elif not is_tracing and torch.all(attention_mask == 1):                        β”‚
β”‚   295 β”‚   β”‚   β”‚   β”‚   if query_length == 1 or key_value_length == query_length:                  β”‚
β”‚   296 β”‚   β”‚   β”‚   β”‚   β”‚   # For query_length == 1, causal attention and bi-directional attenti   β”‚
β”‚   297 β”‚   β”‚   β”‚   β”‚   β”‚   ignore_causal_mask = True
...
AcceleratorError: CUDA error: device-side assert triggered
Search for `cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Other model runs just fine with same environment and hardware, so I thought it's better to open an disscussion here.

Added:
It occurs on batch size > 2. Batch size = 1 seems work just fine.

Seems like vocab size is different for some reason

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Tri-21B-Think")
print(len(tokenizer))

124418

config.json

"vocab_size": 124416

Which one is correct?

Hi @hell0ks , thanks for flagging this! I’ve updated the fix to ensure the vocabulary sizes now match. Please let me know if you run into any other issues.

Thanks, I think batch inference works well after re-download.

hell0ks changed discussion status to closed

Sign up or log in to comment