{ "cells": [ { "cell_type": "markdown", "id": "70300319-d206-43ce-b3bf-3da6b079f20f", "metadata": { "id": "70300319-d206-43ce-b3bf-3da6b079f20f" }, "source": [ "## MusicGen in 🤗 Transformers\n", "\n", "**by [Sanchit Gandhi](https://huggingface.co/sanchit-gandhi)**\n", "\n", "MusicGen is a Transformer-based model capable fo generating high-quality music samples conditioned on text descriptions or audio prompts. It was proposed in the paper [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) by Jade Copet et al. from Meta AI.\n", "\n", "The MusicGen model can be de-composed into three distinct stages:\n", "1. The text descriptions are passed through a frozen text encoder model to obtain a sequence of hidden-state representations\n", "2. The MusicGen decoder is then trained to predict discrete audio tokens, or *audio codes*, conditioned on these hidden-states\n", "3. These audio tokens are then decoded using an audio compression model, such as EnCodec, to recover the audio waveform\n", "\n", "The pre-trained MusicGen checkpoints use Google's [t5-base](https://huggingface.co/t5-base) as the text encoder model, and [EnCodec 32kHz](https://huggingface.co/facebook/encodec_32khz) as the audio compression model. The MusicGen decoder is a pure language model architecture,\n", "trained from scratch on the task of music generation.\n", "\n", "The novelty in the MusicGen model is how the audio codes are predicted. Traditionally, each codebook has to be predicted by a separate model (i.e. hierarchically) or by continuously refining the output of the Transformer model (i.e. upsampling). MusicGen uses an efficient *token interleaving pattern*, thus eliminating the need to cascade multiple models to predict a set of codebooks. Instead, it is able to generate the full set of codebooks in a single forward pass of the decoder, resulting in much faster inference.\n", "\n", "
\n",
"
\n",
"