specifying `init_audio` when calling `generate_diffusion_cond` doesn't work as expected

#47

by jps-la - opened Apr 8, 2025

Apr 8, 2025

•

edited Apr 9, 2025

Hi. I'm using the example python code from the "model card" page. I am loading an input .wav into a Tensor and providing it, with its sample rate, to generate_diffusion_cond() thusly:

...
    inputInfo = torchaudio.info("./input.wav") # AudioMetaData(sample_rate=44100, num_frames=2097152, num_channels=2, bits_per_sample=16, encoding=PCM_S)
   
   (inputTensor, inputSampleRate) = torchaudio.load("./input.wav")
    inputTuple = tuple([inputSampleRate, inputTensor])
...
    output = generate_diffusion_cond(
        model,
        steps=100,
        cfg_scale=7,
        conditioning=conditioning,
        sample_size=sample_size,
        init_audio=inputTuple,
        sigma_min=0.3,
        sigma_max=500,
        sampler_type="dpmpp-3m-sde",
        device=device
    )

Basically, the generated output output.wav is identical to input.wav, as though the conditioning prompt was completely ignored. I was expecting the input file to be transformed byt he conditioning prompt into something different (for example, changing rock music into piano, or whatever).

Forking behavior seems to preclude using the python debugger. Did I misunderstand something?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment