| # AudioCraft conditioning modules |
|
|
| AudioCraft provides a |
| [modular implementation of conditioning modules](../audiocraft/modules/conditioners.py) |
| that can be used with the language model to condition the generation. |
| The codebase was developed in order to easily extend the set of modules |
| currently supported to easily develop new ways of controlling the generation. |
|
|
|
|
| ## Conditioning methods |
|
|
| For now, we support 3 main types of conditioning within AudioCraft: |
| * Text-based conditioning methods |
| * Waveform-based conditioning methods |
| * Joint embedding conditioning methods for text and audio projected in a shared latent space. |
|
|
| The Language Model relies on 2 core components that handle processing information: |
| * The `ConditionProvider` class, that maps metadata to processed conditions leveraging |
| all the defined conditioners for the given task. |
| * The `ConditionFuser` class, that takes preprocessed conditions and properly fuse the |
| conditioning embedding to the language model inputs following a given fusing strategy. |
|
|
| Different conditioners (for text, waveform, joint embeddings...) are provided as torch |
| modules in AudioCraft and are used internally in the language model to process the |
| conditioning signals and feed them to the language model. |
|
|
|
|
| ## Core concepts |
|
|
| ### Conditioners |
|
|
| The `BaseConditioner` torch module is the base implementation for all conditioners in audiocraft. |
|
|
| Each conditioner is expected to implement 2 methods: |
| * The `tokenize` method that is used as a preprocessing method that contains all processing |
| that can lead to synchronization points (e.g. BPE tokenization with transfer to the GPU). |
| The output of the tokenize method will then be used to feed the forward method. |
| * The `forward` method that takes the output of the tokenize method and contains the core computation |
| to obtain the conditioning embedding along with a mask indicating valid indices (e.g. padding tokens). |
|
|
| ### ConditionProvider |
|
|
| The ConditionProvider prepares and provides conditions given a dictionary of conditioners. |
|
|
| Conditioners are specified as a dictionary of attributes and the corresponding conditioner |
| providing the processing logic for the given attribute. |
|
|
| Similarly to the conditioners, the condition provider works in two steps to avoid sychronization points: |
| * A `tokenize` method that takes a list of conditioning attributes for the batch, |
| and run all tokenize steps for the set of conditioners. |
| * A `forward` method that takes the output of the tokenize step and run all the forward steps |
| for the set of conditioners. |
|
|
| The list of conditioning attributes is passed as a list of `ConditioningAttributes` |
| that is presented just below. |
|
|
| ### ConditionFuser |
|
|
| Once all conditioning signals have been extracted and processed by the `ConditionProvider` |
| as dense embeddings, they remain to be passed to the language model along with the original |
| language model inputs. |
|
|
| The `ConditionFuser` handles specifically the logic to combine the different conditions |
| to the actual model input, supporting different strategies to combine them. |
|
|
| One can therefore define different strategies to combine or fuse the condition to the input, in particular: |
| * Prepending the conditioning signal to the input with the `prepend` strategy, |
| * Summing the conditioning signal to the input with the `sum` strategy, |
| * Combining the conditioning relying on a cross-attention mechanism with the `cross` strategy, |
| * Using input interpolation with the `input_interpolate` strategy. |
|
|
| ### SegmentWithAttributes and ConditioningAttributes: From metadata to conditions |
|
|
| The `ConditioningAttributes` dataclass is the base class for metadata |
| containing all attributes used for conditioning the language model. |
|
|
| It currently supports the following types of attributes: |
| * Text conditioning attributes: Dictionary of textual attributes used for text-conditioning. |
| * Wav conditioning attributes: Dictionary of waveform attributes used for waveform-based |
| conditioning such as the chroma conditioning. |
| * JointEmbed conditioning attributes: Dictionary of text and waveform attributes |
| that are expected to be represented in a shared latent space. |
|
|
| These different types of attributes are the attributes that are processed |
| by the different conditioners. |
|
|
| `ConditioningAttributes` are extracted from metadata loaded along the audio in the datasets, |
| provided that the metadata used by the dataset implements the `SegmentWithAttributes` abstraction. |
|
|
| All metadata-enabled datasets to use for conditioning in AudioCraft inherits |
| the [`audiocraft.data.info_dataset.InfoAudioDataset`](../audiocraft/data/info_audio_dataset.py) class |
| and the corresponding metadata inherits and implements the `SegmentWithAttributes` abstraction. |
| Refer to the [`audiocraft.data.music_dataset.MusicAudioDataset`](../audiocraft/data/music_dataset.py) |
| class as an example. |
|
|
|
|
| ## Available conditioners |
|
|
| ### Text conditioners |
|
|
| All text conditioners are expected to inherit from the `TextConditioner` class. |
|
|
| AudioCraft currently provides two text conditioners: |
| * The `LUTConditioner` that relies on look-up-table of embeddings learned at train time, |
| and relying on either no tokenizer or a spacy tokenizer. This conditioner is particularly |
| useful for simple experiments and categorical labels. |
| * The `T5Conditioner` that relies on a |
| [pre-trained T5 model](https://huggingface.co/docs/transformers/model_doc/t5) |
| frozen or fine-tuned at train time to extract the text embeddings. |
|
|
| ### Waveform conditioners |
|
|
| All waveform conditioners are expected to inherit from the `WaveformConditioner` class and |
| consists of conditioning method that takes a waveform as input. The waveform conditioner |
| must implement the logic to extract the embedding from the waveform and define the downsampling |
| factor from the waveform to the resulting embedding. |
|
|
| The `ChromaStemConditioner` conditioner is a waveform conditioner for the chroma features |
| conditioning used by MusicGen. It takes a given waveform, extract relevant stems for melody |
| (namely all non drums and bass stems) using a |
| [pre-trained Demucs model](https://github.com/facebookresearch/demucs) |
| and then extract the chromagram bins from the remaining mix of stems. |
|
|
| ### Joint embeddings conditioners |
|
|
| We finally provide support for conditioning based on joint text and audio embeddings through |
| the `JointEmbeddingConditioner` class and the `CLAPEmbeddingConditioner` that implements such |
| a conditioning method relying on a [pretrained CLAP model](https://github.com/LAION-AI/CLAP). |
|
|
| ## Classifier Free Guidance |
|
|
| We provide a Classifier Free Guidance implementation in AudioCraft. With the classifier free |
| guidance dropout, all attributes are dropped with the same probability. |
|
|
| ## Attribute Dropout |
|
|
| We further provide an attribute dropout strategy. Unlike the classifier free guidance dropout, |
| the attribute dropout drops given attributes with a defined probability, allowing the model |
| not to expect all conditioning signals to be provided at once. |
|
|
| ## Faster computation of conditions |
|
|
| Conditioners that require some heavy computation on the waveform can be cached, in particular |
| the `ChromaStemConditioner` or `CLAPEmbeddingConditioner`. You just need to provide the |
| `cache_path` parameter to them. We recommend running dummy jobs for filling up the cache quickly. |
| An example is provied in the [musicgen.musicgen_melody_32khz grid](../audiocraft/grids/musicgen/musicgen_melody_32khz.py). |