cuda out of memory on 30m audio

#20
by dominik-machacek - opened

hi,

I'm getting CUDA OOM error when processing 30-minute audio.
The GPU is NVIDIA L40 with 46068MiB, it was similar with 94G one.
My usage script is this: https://github.com/sarapapi/hearing2translate/blob/4bbc37e0785ea60a740de3a613c11d0c602698b5/inference/sfm/canaryv2.py#L13

but this line edited -- batch_size didn't work, [speech] instead of speech didn't work.

    transcriptions = model.transcribe([speech], source_lang=src, target_lang=tgt, timestsamps=True, batch_size=1)

The error is as follows:

Traceback (most recent call last):
  File "/lnet/work/people/machacek/uedin/systems/hearing2translate/infer.py", line 242, in <module>
    infer(args)
  File "/lnet/work/people/machacek/uedin/systems/hearing2translate/infer.py", line 204, in infer
    output = generate(model, model_input).strip()
  File "/lnet/work/people/machacek/uedin/systems/hearing2translate/inference/sfm/canaryv2.py", line 13, in generate
    transcriptions = model.transcribe([speech], source_lang=src, target_lang=tgt, timestsamps=True, batch_size=1)
  File "/lnet/work/people/machacek/uedin/systems/hearing2translate/p3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/lnet/work/people/machacek/uedin/systems/hearing2translate/p3/lib/python3.10/site-packages/nemo/collections/asr/models/aed_multitask_models.py", line 581, in transcribe
    results = super().transcribe(audio=audio, override_config=trcfg)
  File "/lnet/work/people/machacek/uedin/systems/hearing2translate/p3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/lnet/work/people/machacek/uedin/systems/hearing2translate/p3/lib/python3.10/site-packages/nemo/collections/asr/parts/mixins/transcription.py", line 270, in transcribe
    for processed_outputs in generator:
  File "/lnet/work/people/machacek/uedin/systems/hearing2translate/p3/lib/python3.10/site-packages/nemo/collections/asr/parts/mixins/transcription.py", line 369, in transcribe_generator
    model_outputs = self._transcribe_forward(test_batch, transcribe_cfg)
  File "/lnet/work/people/machacek/uedin/systems/hearing2translate/p3/lib/python3.10/site-packages/nemo/collections/asr/models/aed_multitask_models.py", line 958, in _transcribe_forward
    log_probs, encoded_len, enc_states, enc_mask = self.forward(input_signal=audio, input_signal_length=audio_lens)
  File "/lnet/work/people/machacek/uedin/systems/hearing2translate/p3/lib/python3.10/site-packages/nemo/core/classes/common.py", line 1204, in wrapped_call
    outputs = wrapped(*args, **kwargs)
  File "/lnet/work/people/machacek/uedin/systems/hearing2translate/p3/lib/python3.10/site-packages/nemo/collections/asr/models/aed_multitask_models.py", line 743, in forward
    encoded, encoded_len = self.encoder(audio_signal=processed_signal, length=processed_signal_length)
  File "/lnet/work/people/machacek/uedin/systems/hearing2translate/p3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/lnet/work/people/machacek/uedin/systems/hearing2translate/p3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lnet/work/people/machacek/uedin/systems/hearing2translate/p3/lib/python3.10/site-packages/nemo/core/classes/common.py", line 1204, in wrapped_call
    outputs = wrapped(*args, **kwargs)
  File "/lnet/work/people/machacek/uedin/systems/hearing2translate/p3/lib/python3.10/site-packages/nemo/collections/asr/modules/conformer_encoder.py", line 584, in forward
    return self.forward_internal(
  File "/lnet/work/people/machacek/uedin/systems/hearing2translate/p3/lib/python3.10/site-packages/nemo/collections/asr/modules/conformer_encoder.py", line 683, in forward_internal
    audio_signal = layer(
  File "/lnet/work/people/machacek/uedin/systems/hearing2translate/p3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/lnet/work/people/machacek/uedin/systems/hearing2translate/p3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lnet/work/people/machacek/uedin/systems/hearing2translate/p3/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/conformer_modules.py", line 181, in forward
    x = self.self_attn(query=x, key=x, value=x, mask=att_mask, pos_emb=pos_emb, cache=cache_last_channel)
  File "/lnet/work/people/machacek/uedin/systems/hearing2translate/p3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/lnet/work/people/machacek/uedin/systems/hearing2translate/p3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lnet/work/people/machacek/uedin/systems/hearing2translate/p3/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/multi_head_attention.py", line 314, in forward
    matrix_bd = self.rel_shift(matrix_bd)
  File "/lnet/work/people/machacek/uedin/systems/hearing2translate/p3/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/multi_head_attention.py", line 266, in rel_shift
    x = torch.nn.functional.pad(x, pad=(1, 0))  # (b, h, t1, t2+1)
  File "/lnet/work/people/machacek/uedin/systems/hearing2translate/p3/lib/python3.10/site-packages/torch/nn/functional.py", line 5294, in pad
    return torch._C._nn.pad(input, pad, mode, value)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.14 GiB. GPU 0 has a total capacity of 44.32 GiB of which 2.64 GiB is free. Process 224045 has 41.67 GiB memory in use. Of the allocated memory 41.11 GiB is allocated by PyTorch, and 59.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
[W121 17:37:12.219515031 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator())

The weird thing is that through demo the same audio works: https://huggingface.co/spaces/nvidia/canary-1b-v2
I just found out that the demo uses H200 GPU and that has 141 G memory. Is it really needed? Is there another approach to process long audio with small GPU?

Thanks!

Sign up or log in to comment