dkounadis
/

artificial-styletts2

audio-generation

Model card Files Files and versions

Dionyssos commited on Feb 24, 2025

Commit

8a9a2fe

·

1 Parent(s): 376e6a0

offset

Files changed (1) hide show

audiocraft/lm.py +1 -1

audiocraft/lm.py CHANGED Viewed

@@ -143,7 +143,7 @@ class LMModel(nn.Module):
             next_token = self.forward(out_codes[:, 0, [0, 1, 2, 3], torch.tensor([3, 2, 1, 0]) + offset][:, :, None],  # index diagonal & exapnd to [bs, n_q, dur=1]
                                       #gen_sequence[:, 0, :, offset-1:offset],  # DIAGINDEXING for setting prediction of lm into gen_sequence THE GENSEQUENCE has to be un-delayed in the end [Because it has to be de-delayed for the vocoder then is actually only the lm input that requires to see the delay thus we could just feed by diaggather] so it matches gen_codes -1 a[[0, 1, 2, 3], torch.tensor([0, 1, 2, 3]) + 5]  the gen_sequence is indexed by vertical column and fed to lm however the prediction of lm is place diagonally with delay to the gen_sequence
                                       condition_tensors=text_condition,  # utilisation of the attention mask of txt condition ?
-                                      token_count=offset-1)  # [bs, 4, 1, 2048]

             next_token = self.forward(out_codes[:, 0, [0, 1, 2, 3], torch.tensor([3, 2, 1, 0]) + offset][:, :, None],  # index diagonal & exapnd to [bs, n_q, dur=1]
                                       #gen_sequence[:, 0, :, offset-1:offset],  # DIAGINDEXING for setting prediction of lm into gen_sequence THE GENSEQUENCE has to be un-delayed in the end [Because it has to be de-delayed for the vocoder then is actually only the lm input that requires to see the delay thus we could just feed by diaggather] so it matches gen_codes -1 a[[0, 1, 2, 3], torch.tensor([0, 1, 2, 3]) + 5]  the gen_sequence is indexed by vertical column and fed to lm however the prediction of lm is place diagonally with delay to the gen_sequence
                                       condition_tensors=text_condition,  # utilisation of the attention mask of txt condition ?
+                                      token_count=offset)  # [bs, 4, 1, 2048]