UniMus
/

OpenJMLA

Text Generation

feature-extraction

music foundation model

Model card Files Files and versions

sino commited on Dec 13, 2023

Commit

9a4a887

·

1 Parent(s): ed14d60

Update README.md

Files changed (1) hide show

README.md +9 -2

README.md CHANGED Viewed

@@ -11,8 +11,15 @@ pipeline_tag: text-generation
 </p>
 <br>
-Music tagging is a task to predict the tags of music recordings. However, previous music tagging research primarily focuses on close-set music tagging tasks which can not be generalized to new tags. In this work, we propose a zero-shot music tagging system modeled by a joint music and language attention (**JMLA**) model to address the open-set music tagging problem. The **JMLA** model consists of an audio encoder modeled by a pretrained masked autoencoder and a decoder modeled by a Falcon7B.
-We introduce preceiver resampler to convert arbitrary length audio into fixed length embeddings. We introduce dense attention connections between encoder and decoder layers to improve the information flow between the encoder and decoder layers. We collect a large-scale music and description dataset from the internet. We propose to use ChatGPT to convert the raw descriptions into formalized and diverse descriptions to train the **JMLA** models. Our proposed **JMLA** system achieves a zero-shot audio tagging accuracy of $ 64.82\% $ on the GTZAN dataset, outperforming previous zero-shot systems and achieves comparable results to previous systems on the FMA and the MagnaTagATune datasets.
 ## Requirements

 </p>
 <br>
+Music tagging is a task to predict the tags of music recordings.
+However, previous music tagging research primarily focuses on close-set music tagging tasks which can not be generalized to new tags.
+In this work, we propose a zero-shot music tagging system modeled by a joint music and language attention (**JMLA**) model to address the open-set music tagging problem.
+The **JMLA** model consists of an audio encoder modeled by a pretrained masked autoencoder and a decoder modeled by a Falcon7B.
+We introduce preceiver resampler to convert arbitrary length audio into fixed length embeddings.
+We introduce dense attention connections between encoder and decoder layers to improve the information flow between the encoder and decoder layers.
+We collect a large-scale music and description dataset from the internet.
+We propose to use ChatGPT to convert the raw descriptions into formalized and diverse descriptions to train the **JMLA** models.
+Our proposed **JMLA** system achieves a zero-shot audio tagging accuracy of 64.82% on the GTZAN dataset, outperforming previous zero-shot systems and achieves comparable results to previous systems on the FMA and the MagnaTagATune datasets.
 ## Requirements