Model output returns multiple speaker_id every fraction of a second

#13
by doggydogger - opened

Hey guys, I'm working on a Vietnamese diarization model, splitting large audio inputs into 1m20s chunks (trying to limit the number of speakers in each chunk < 4) to send to the triton server for inference. While testing my model I notice certain outputs were kind of odd and I was wondering if anyone had encountered this issue before.

Normal output: "Diarization Segments: [{'start_time': 0.0, 'end_time': 8.64, 'speaker_id': 'speaker_0'}, {'start_time': 9.2, 'end_time': 24.16, 'speaker_id': 'speaker_1'}, {'start_time': 24.56, 'end_time': 41.68, 'speaker_id': 'speaker_0'}, {'start_time': 43.28, 'end_time': 57.2, 'speaker_id': 'speaker_2'}, {'start_time': 58.64, 'end_time': 79.99, 'speaker_id': 'speaker_3'}]"

Weird output: "Diarization Segments: [{'start_time': 0.0, 'end_time': 1.44, 'speaker_id': 'speaker_0'}, {'start_time': 2.48, 'end_time': 3.12, 'speaker_id': 'speaker_0'}, {'start_time': 3.36, 'end_time': 4.16, 'speaker_id': 'speaker_0'}, {'start_time': 4.72, 'end_time': 7.12, 'speaker_id': 'speaker_0'}, {'start_time': 7.52, 'end_time': 9.52, 'speaker_id': 'speaker_0'}, {'start_time': 10.24, 'end_time': 11.28, 'speaker_id': 'speaker_0'}, {'start_time': 11.36, 'end_time': 12.16, 'speaker_id': 'speaker_0'}, {'start_time': 12.4, 'end_time': 14.24, 'speaker_id': 'speaker_0'}, {'start_time': 15.12, 'end_time': 17.76, 'speaker_id': 'speaker_1'}, {'start_time': 18.24, 'end_time': 20.64, 'speaker_id': 'speaker_1'}, {'start_time': 20.88, 'end_time': 21.68, 'speaker_id': 'speaker_1'}, {'start_time': 22.0, 'end_time': 23.68, 'speaker_id': 'speaker_1'}, {'start_time': 24.08, 'end_time': 27.04, 'speaker_id': 'speaker_1'}, {'start_time': 27.36, 'end_time': 28.72, 'speaker_id': 'speaker_1'}, {'start_time': 29.28, 'end_time': 30.48, 'speaker_id': 'speaker_1'}, {'start_time': 30.96, 'end_time': 32.96, 'speaker_id': 'speaker_1'}, {'start_time': 33.36, 'end_time': 34.88, 'speaker_id': 'speaker_1'}, {'start_time': 36.0, 'end_time': 37.6, 'speaker_id': 'speaker_0'}, {'start_time': 38.16, 'end_time': 40.56, 'speaker_id': 'speaker_0'}, {'start_time': 41.12, 'end_time': 43.28, 'speaker_id': 'speaker_0'}, {'start_time': 44.24, 'end_time': 45.6, 'speaker_id': 'speaker_0'}, {'start_time': 46.8, 'end_time': 47.84, 'speaker_id': 'speaker_0'}, {'start_time': 48.16, 'end_time': 50.0, 'speaker_id': 'speaker_0'}, {'start_time': 50.16, 'end_time': 50.32, 'speaker_id': 'speaker_0'}, {'start_time': 50.4, 'end_time': 52.64, 'speaker_id': 'speaker_0'}, {'start_time': 53.52, 'end_time': 58.4, 'speaker_id': 'speaker_0'}, {'start_time': 58.56, 'end_time': 59.28, 'speaker_id': 'speaker_0'}, {'start_time': 59.76, 'end_time': 60.32, 'speaker_id': 'speaker_0'}, {'start_time': 60.56, 'end_time': 61.6, 'speaker_id': 'speaker_0'}, {'start_time': 62.16, 'end_time': 62.64, 'speaker_id': 'speaker_1'}, {'start_time': 62.8, 'end_time': 62.88, 'speaker_id': 'speaker_1'}, {'start_time': 63.2, 'end_time': 63.28, 'speaker_id': 'speaker_1'}, {'start_time': 65.36, 'end_time': 65.44, 'speaker_id': 'speaker_1'}, {'start_time': 66.32, 'end_time': 66.48, 'speaker_id': 'speaker_1'}, {'start_time': 69.44, 'end_time': 69.52, 'speaker_id': 'speaker_1'}, {'start_time': 69.84, 'end_time': 70.16, 'speaker_id': 'speaker_1'}, {'start_time': 70.48, 'end_time': 71.12, 'speaker_id': 'speaker_1'}, {'start_time': 71.28, 'end_time': 71.52, 'speaker_id': 'speaker_1'}, {'start_time': 72.32, 'end_time': 72.88, 'speaker_id': 'speaker_1'}, {'start_time': 73.04, 'end_time': 73.76, 'speaker_id': 'speaker_1'}, {'start_time': 73.92, 'end_time': 74.24, 'speaker_id': 'speaker_1'}, {'start_time': 75.2, 'end_time': 76.32, 'speaker_id': 'speaker_1'}, {'start_time': 76.56, 'end_time': 76.64, 'speaker_id': 'speaker_1'}, {'start_time': 77.28, 'end_time': 77.36, 'speaker_id': 'speaker_1'}, {'start_time': 77.44, 'end_time': 77.6, 'speaker_id': 'speaker_1'}, {'start_time': 77.68, 'end_time': 77.84, 'speaker_id': 'speaker_1'}, {'start_time': 78.32, 'end_time': 79.12, 'speaker_id': 'speaker_1'}, {'start_time': 79.28, 'end_time': 79.76, 'speaker_id': 'speaker_1'}, {'start_time': 79.84, 'end_time': 79.99, 'speaker_id': 'speaker_1'}]"

I've uploaded the chunk that gave this messed up output (3 speakers in here). I'm not sure what triggers this problem. My other chunks seem to be fine. I'm following the instruction of the repo in terms of input setup. Please let me know what else I can provide.

Hi, this model is outdated, can you try the newer one https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1?

From what I understand v2.1 only employs the speaker cache for streaming cases, the based architecture remains the same. ("streaming Sortformer follows the architecture of the offline version of Sortformer.") so i'm not sure if there will be any difference. It stands that the behavior of the base model Sortformer is a bit weird in these examples

v2.1 is much more robust, as it was trained on more diverse data.
Also, v1 may have degradation on long recordings, whereas v2.1 doesn't have such issues.

Sign up or log in to comment