`onnxruntime-node crashes with malloc error during autoregressive generation with Chatterbox Turbo ONNX model

#6
by tkpop777 - opened

Hi, using the onnx model in python works great using huggingface transformers and the python version of onnxruntime. But trying to use it in nodejs gives an error with onnx-runtime not being able to process the chatterbox-turbo onnx model.

Environment

Issue

When running autoregressive generation with the language_model.onnx from Chatterbox Turbo, the process crashes with a malloc error after ~12-15 inference iterations:

  malloc: *** error for object 0x...: pointer being freed was not allocated
  Exit code 134

Reproduction

  import * as ort from 'onnxruntime-node';

  const session = await ort.InferenceSession.create('language_model.onnx');

  for (let i = 0; i < 100; i++) {
    const outputs = await session.run({
      inputs_embeds: /* tensor */,
      attention_mask: /* tensor */,
      position_ids: /* tensor */,
      ...pastKeyValues  // empty KV cache tensors
    });
    // Crashes around iteration 12-15
  }

What I've tried (none worked)

  • Copying all tensor data between iterations
  • Using fresh empty KV cache each iteration (no caching)
  • Disabling enableMemPattern and enableCpuMemArena
  • Using fp32 instead of quantized models
  • Different onnxruntime-node versions (1.14.0, 1.21.0, 1.23.2)

Notes

  • The same model works perfectly with Python onnxruntime
  • The crash happens in native code, not JavaScript
  • Other ONNX models (speech_encoder, embed_tokens, conditional_decoder) work fine
  • Only the language_model with repeated inference calls crashes

Sign up or log in to comment