jdp8 commited on Oct 23, 2025

Commit

d02fc34

verified ·

1 Parent(s): fb83646

Initial commit

Browse files

Files changed (34) hide show

README.md +135 -0
feature_extractor/preprocessor_config.json +22 -0
language_model/config.json +39 -0
language_model/model.safetensors +3 -0
language_model/pytorch_model.bin +3 -0
model_index.json +48 -0
projection_model/config.json +7 -0
projection_model/diffusion_pytorch_model.bin +3 -0
projection_model/diffusion_pytorch_model.safetensors +3 -0
scheduler/scheduler_config.json +19 -0
text_encoder/config.json +35 -0
text_encoder/model.safetensors +3 -0
text_encoder/pytorch_model.bin +3 -0
text_encoder_2/config.json +32 -0
text_encoder_2/model.safetensors +3 -0
text_encoder_2/pytorch_model.bin +3 -0
tokenizer/merges.txt +0 -0
tokenizer/special_tokens_map.json +15 -0
tokenizer/tokenizer.json +0 -0
tokenizer/tokenizer_config.json +20 -0
tokenizer/vocab.json +0 -0
tokenizer_2/special_tokens_map.json +107 -0
tokenizer_2/spiece.model +3 -0
tokenizer_2/tokenizer.json +0 -0
tokenizer_2/tokenizer_config.json +112 -0
unet/config.json +74 -0
unet/diffusion_pytorch_model.bin +3 -0
unet/diffusion_pytorch_model.safetensors +3 -0
vae/config.json +28 -0
vae/diffusion_pytorch_model.bin +3 -0
vae/diffusion_pytorch_model.safetensors +3 -0
vocoder/config.json +50 -0
vocoder/model.safetensors +3 -0
vocoder/pytorch_model.bin +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,135 @@

+---
+license: cc-by-nc-sa-4.0
+---
+# AudioLDM 2
+AudioLDM 2 is a latent text-to-audio diffusion model capable of generating realistic audio samples given any text input.
+It is available in the 🧨 Diffusers library from v0.21.0 onwards.
+# Model Details
+AudioLDM 2 was proposed in the paper [AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining](https://arxiv.org/abs/2308.05734) by Haohe Liu et al.
+AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects,
+human speech and music.
+# Checkpoint Details
+This is the original, **base** version of the AudioLDM 2 model, also referred to as **audioldm2-full**.
+There are three official AudioLDM 2 checkpoints. Two of these checkpoints are applicable to the general task of text-to-audio
+generation. The third checkpoint is trained exclusively on text-to-music generation. All checkpoints share the same
+model size for the text encoders and VAE. They differ in the size and depth of the UNet. See table below for details on
+the three official checkpoints:
+| Checkpoint                                                      | Task          | UNet Model Size | Total Model Size | Training Data / h |
+|-----------------------------------------------------------------|---------------|-----------------|------------------|-------------------|
+| [audioldm2](https://huggingface.co/cvssp/audioldm2)             | Text-to-audio | 350M            | 1.1B             | 1150k             |
+| [audioldm2-large](https://huggingface.co/cvssp/audioldm2-large) | Text-to-audio | 750M            | 1.5B             | 1150k             |
+| [audioldm2-music](https://huggingface.co/cvssp/audioldm2-music) | Text-to-music | 350M            | 1.1B             | 665k              |
+## Model Sources
+- [**Original Repository**](https://github.com/haoheliu/audioldm2)
+- [**🧨 Diffusers Pipeline**](https://huggingface.co/docs/diffusers/api/pipelines/audioldm2)
+- [**Paper**](https://arxiv.org/abs/2308.05734)
+- [**Demo**](https://huggingface.co/spaces/haoheliu/audioldm2-text2audio-text2music)
+# Usage
+First, install the required packages:
+```
+pip install --upgrade diffusers transformers accelerate
+```
+## Text-to-Audio
+For text-to-audio generation, the [AudioLDM2Pipeline](https://huggingface.co/docs/diffusers/api/pipelines/audioldm2) can be
+used to load pre-trained weights and generate text-conditional audio outputs:
+```python
+from diffusers import AudioLDM2Pipeline
+import torch
+repo_id = "cvssp/audioldm2"
+pipe = AudioLDM2Pipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
+pipe = pipe.to("cuda")
+prompt = "The sound of a hammer hitting a wooden surface"
+audio = pipe(prompt, num_inference_steps=200, audio_length_in_s=10.0).audios[0]
+```
+The resulting audio output can be saved as a .wav file:
+```python
+import scipy
+scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
+```
+Or displayed in a Jupyter Notebook / Google Colab:
+```python
+from IPython.display import Audio
+Audio(audio, rate=16000)
+```
+## Tips
+Prompts:
+* Descriptive prompt inputs work best: you can use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g., "water stream in a forest" instead of "stream").
+* It's best to use general terms like 'cat' or 'dog' instead of specific names or abstract objects that the model may not be familiar with.
+Inference:
+* The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument: higher steps give higher quality audio at the expense of slower inference.
+* The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument.
+When evaluating generated waveforms:
+* The quality of the generated waveforms can vary significantly based on the seed. Try generating with different seeds until you find a satisfactory generation
+* Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly.
+The following example demonstrates how to construct a good audio generation using the aforementioned tips:
+```python
+import scipy
+import torch
+from diffusers import AudioLDM2Pipeline
+# load the pipeline
+repo_id = "cvssp/audioldm2"
+pipe = AudioLDM2Pipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
+pipe = pipe.to("cuda")
+# define the prompts
+prompt = "The sound of a hammer hitting a wooden surface"
+negative_prompt = "Low quality."
+# set the seed
+generator = torch.Generator("cuda").manual_seed(0)
+# run the generation
+audio = pipe(
+    prompt,
+    negative_prompt=negative_prompt,
+    num_inference_steps=200,
+    audio_length_in_s=10.0,
+    num_waveforms_per_prompt=3,
+).audios
+# save the best audio sample (index 0) as a .wav file
+scipy.io.wavfile.write("techno.wav", rate=16000, data=audio[0])
+```
+# Citation
+**BibTeX:**
+```
+@article{liu2023audioldm2,
+  title={"AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining"},
+  author={Haohe Liu and Qiao Tian and Yi Yuan and Xubo Liu and Xinhao Mei and Qiuqiang Kong and Yuping Wang and Wenwu Wang and Yuxuan Wang and Mark D. Plumbley},
+  journal={arXiv preprint arXiv:2308.05734},
+  year={2023}
+}
+```

feature_extractor/preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,22 @@

+{
+  "chunk_length_s": 10,
+  "feature_extractor_type": "ClapFeatureExtractor",
+  "feature_size": 64,
+  "fft_window_size": 1024,
+  "frequency_max": 14000,
+  "frequency_min": 50,
+  "hop_length": 480,
+  "max_length_s": 10,
+  "n_fft": 1024,
+  "nb_frequency_bins": 513,
+  "nb_max_frames": 1000,
+  "nb_max_samples": 480000,
+  "padding": "repeatpad",
+  "padding_side": "right",
+  "padding_value": 0.0,
+  "processor_class": "ClapProcessor",
+  "return_attention_mask": false,
+  "sampling_rate": 48000,
+  "top_db": null,
+  "truncation": "rand_trunc"
+}

language_model/config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "activation_function": "gelu_new",
+  "architectures": [
+    "GPT2Model"
+  ],
+  "attn_pdrop": 0.1,
+  "bos_token_id": 50256,
+  "embd_pdrop": 0.1,
+  "eos_token_id": 50256,
+  "initializer_range": 0.02,
+  "layer_norm_epsilon": 1e-05,
+  "max_new_tokens": 8,
+  "model_type": "gpt2",
+  "n_ctx": 1024,
+  "n_embd": 768,
+  "n_head": 12,
+  "n_inner": null,
+  "n_layer": 12,
+  "n_positions": 1024,
+  "reorder_and_upcast_attn": false,
+  "resid_pdrop": 0.1,
+  "scale_attn_by_inverse_layer_idx": false,
+  "scale_attn_weights": true,
+  "summary_activation": null,
+  "summary_first_dropout": 0.1,
+  "summary_proj_to_labels": true,
+  "summary_type": "cls_index",
+  "summary_use_proj": true,
+  "task_specific_params": {
+    "text-generation": {
+      "do_sample": true,
+      "max_length": 50
+    }
+  },
+  "torch_dtype": "float32",
+  "transformers_version": "4.32.0.dev0",
+  "use_cache": true,
+  "vocab_size": 50257
+}

language_model/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:925012ada53083b40604540406b53570066b6d218380af45dd426fa531b875fb
+size 497772432

language_model/pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:80896089a320949684e5150079143ee3061df687124216292da482e3b79ddc64
+size 497803293

model_index.json ADDED Viewed

	@@ -0,0 +1,48 @@

+{
+  "_class_name": "AudioLDM2Pipeline",
+  "_diffusers_version": "0.20.0.dev0",
+  "feature_extractor": [
+    "transformers",
+    "ClapFeatureExtractor"
+  ],
+  "language_model": [
+    "transformers",
+    "GPT2LMHeadModel"
+  ],
+  "projection_model": [
+    "audioldm2",
+    "AudioLDM2ProjectionModel"
+  ],
+  "scheduler": [
+    "diffusers",
+    "DDIMScheduler"
+  ],
+  "text_encoder": [
+    "transformers",
+    "ClapModel"
+  ],
+  "text_encoder_2": [
+    "transformers",
+    "T5EncoderModel"
+  ],
+  "tokenizer": [
+    "transformers",
+    "RobertaTokenizerFast"
+  ],
+  "tokenizer_2": [
+    "transformers",
+    "T5TokenizerFast"
+  ],
+  "unet": [
+    "audioldm2",
+    "AudioLDM2UNet2DConditionModel"
+  ],
+  "vae": [
+    "diffusers",
+    "AutoencoderKL"
+  ],
+  "vocoder": [
+    "transformers",
+    "SpeechT5HifiGan"
+  ]
+}

projection_model/config.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "_class_name": "AudioLDM2ProjectionModel",
+  "_diffusers_version": "0.20.0.dev0",
+  "langauge_model_dim": 768,
+  "text_encoder_1_dim": 1024,
+  "text_encoder_dim": 512
+}

projection_model/diffusion_pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cfb555ca6f1d76278436c48bafea78b5122b9496434694cb8866c096fb1c6ad0
+size 4739951

projection_model/diffusion_pytorch_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8d4d8b1233e8193c784ac7c99aed9f76b66312a9ddfe8b1bbad68fe03dd71bde
+size 4737688

scheduler/scheduler_config.json ADDED Viewed

	@@ -0,0 +1,19 @@

+{
+  "_class_name": "DDIMScheduler",
+  "_diffusers_version": "0.20.0.dev0",
+  "beta_end": 0.0195,
+  "beta_schedule": "scaled_linear",
+  "beta_start": 0.0015,
+  "clip_sample": false,
+  "clip_sample_range": 1.0,
+  "dynamic_thresholding_ratio": 0.995,
+  "num_train_timesteps": 1000,
+  "prediction_type": "epsilon",
+  "rescale_betas_zero_snr": false,
+  "sample_max_value": 1.0,
+  "set_alpha_to_one": false,
+  "steps_offset": 1,
+  "thresholding": false,
+  "timestep_spacing": "leading",
+  "trained_betas": null
+}

text_encoder/config.json ADDED Viewed

	@@ -0,0 +1,35 @@

+{
+  "architectures": [
+    "ClapModel"
+  ],
+  "audio_config": {
+    "depths": [
+      2,
+      2,
+      12,
+      2
+    ],
+    "fusion_num_hidden_layers": 2,
+    "hidden_size": 1024,
+    "model_type": "clap_audio_model",
+    "patch_embeds_hidden_size": 128,
+    "projection_hidden_size": 768
+  },
+  "hidden_size": 768,
+  "initializer_factor": 1.0,
+  "logit_scale_init_value": 14.285714285714285,
+  "model_type": "clap",
+  "num_hidden_layers": 16,
+  "projection_dim": 512,
+  "projection_hidden_act": "relu",
+  "text_config": {
+    "classifier_dropout": null,
+    "fusion_hidden_size": 768,
+    "fusion_num_hidden_layers": 2,
+    "initializer_range": 0.02,
+    "model_type": "clap_text_model",
+    "projection_hidden_size": 768
+  },
+  "torch_dtype": "float64",
+  "transformers_version": "4.32.0.dev0"
+}

text_encoder/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a4a47b4a637dd58e9edb7b64a06acf37328b7cc3eafb0b8a85df895cc9e45d09
+size 776327432

text_encoder/pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:637b3ff0f7b212cedafb00739521dc49d8f7953f12bfc1f76ff692f108a41ed0
+size 776444665

text_encoder_2/config.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+  "architectures": [
+    "T5EncoderModel"
+  ],
+  "classifier_dropout": 0.0,
+  "d_ff": 2816,
+  "d_kv": 64,
+  "d_model": 1024,
+  "decoder_start_token_id": 0,
+  "dense_act_fn": "gelu_new",
+  "dropout_rate": 0.1,
+  "eos_token_id": 1,
+  "feed_forward_proj": "gated-gelu",
+  "initializer_factor": 1.0,
+  "is_encoder_decoder": true,
+  "is_gated_act": true,
+  "layer_norm_epsilon": 1e-06,
+  "model_type": "t5",
+  "n_positions": 512,
+  "num_decoder_layers": 24,
+  "num_heads": 16,
+  "num_layers": 24,
+  "output_past": true,
+  "pad_token_id": 0,
+  "relative_attention_max_distance": 128,
+  "relative_attention_num_buckets": 32,
+  "tie_word_embeddings": false,
+  "torch_dtype": "float32",
+  "transformers_version": "4.32.0.dev0",
+  "use_cache": true,
+  "vocab_size": 32128
+}

text_encoder_2/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c1d0c8f1c739db9343c12ea4b0e3f2c97a833b3c072c251e91d97b7326fefb4e
+size 1364951064

text_encoder_2/pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8c4be8e23954ef72bd0d623206a46b7e1ab7fa23f530b7b9f691d40785273b27
+size 1364996921

tokenizer/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "bos_token": "<s>",
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "unk_token": "<unk>"
+}

tokenizer/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "add_prefix_space": false,
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "errors": "replace",
+  "mask_token": "<mask>",
+  "max_length": null,
+  "model_max_length": 512,
+  "pad_to_multiple_of": null,
+  "pad_token": "<pad>",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "processor_class": "ClapProcessor",
+  "sep_token": "</s>",
+  "tokenizer_class": "RobertaTokenizer",
+  "trim_offsets": true,
+  "unk_token": "<unk>"
+}

tokenizer/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_2/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,107 @@

+{
+  "additional_special_tokens": [
+    "<extra_id_0>",
+    "<extra_id_1>",
+    "<extra_id_2>",
+    "<extra_id_3>",
+    "<extra_id_4>",
+    "<extra_id_5>",
+    "<extra_id_6>",
+    "<extra_id_7>",
+    "<extra_id_8>",
+    "<extra_id_9>",
+    "<extra_id_10>",
+    "<extra_id_11>",
+    "<extra_id_12>",
+    "<extra_id_13>",
+    "<extra_id_14>",
+    "<extra_id_15>",
+    "<extra_id_16>",
+    "<extra_id_17>",
+    "<extra_id_18>",
+    "<extra_id_19>",
+    "<extra_id_20>",
+    "<extra_id_21>",
+    "<extra_id_22>",
+    "<extra_id_23>",
+    "<extra_id_24>",
+    "<extra_id_25>",
+    "<extra_id_26>",
+    "<extra_id_27>",
+    "<extra_id_28>",
+    "<extra_id_29>",
+    "<extra_id_30>",
+    "<extra_id_31>",
+    "<extra_id_32>",
+    "<extra_id_33>",
+    "<extra_id_34>",
+    "<extra_id_35>",
+    "<extra_id_36>",
+    "<extra_id_37>",
+    "<extra_id_38>",
+    "<extra_id_39>",
+    "<extra_id_40>",
+    "<extra_id_41>",
+    "<extra_id_42>",
+    "<extra_id_43>",
+    "<extra_id_44>",
+    "<extra_id_45>",
+    "<extra_id_46>",
+    "<extra_id_47>",
+    "<extra_id_48>",
+    "<extra_id_49>",
+    "<extra_id_50>",
+    "<extra_id_51>",
+    "<extra_id_52>",
+    "<extra_id_53>",
+    "<extra_id_54>",
+    "<extra_id_55>",
+    "<extra_id_56>",
+    "<extra_id_57>",
+    "<extra_id_58>",
+    "<extra_id_59>",
+    "<extra_id_60>",
+    "<extra_id_61>",
+    "<extra_id_62>",
+    "<extra_id_63>",
+    "<extra_id_64>",
+    "<extra_id_65>",
+    "<extra_id_66>",
+    "<extra_id_67>",
+    "<extra_id_68>",
+    "<extra_id_69>",
+    "<extra_id_70>",
+    "<extra_id_71>",
+    "<extra_id_72>",
+    "<extra_id_73>",
+    "<extra_id_74>",
+    "<extra_id_75>",
+    "<extra_id_76>",
+    "<extra_id_77>",
+    "<extra_id_78>",
+    "<extra_id_79>",
+    "<extra_id_80>",
+    "<extra_id_81>",
+    "<extra_id_82>",
+    "<extra_id_83>",
+    "<extra_id_84>",
+    "<extra_id_85>",
+    "<extra_id_86>",
+    "<extra_id_87>",
+    "<extra_id_88>",
+    "<extra_id_89>",
+    "<extra_id_90>",
+    "<extra_id_91>",
+    "<extra_id_92>",
+    "<extra_id_93>",
+    "<extra_id_94>",
+    "<extra_id_95>",
+    "<extra_id_96>",
+    "<extra_id_97>",
+    "<extra_id_98>",
+    "<extra_id_99>"
+  ],
+  "eos_token": "</s>",
+  "pad_token": "<pad>",
+  "unk_token": "<unk>"
+}

tokenizer_2/spiece.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d60acb128cf7b7f2536e8f38a5b18a05535c9e14c7a355904270e15b0945ea86
+size 791656

tokenizer_2/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_2/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,112 @@

+{
+  "additional_special_tokens": [
+    "<extra_id_0>",
+    "<extra_id_1>",
+    "<extra_id_2>",
+    "<extra_id_3>",
+    "<extra_id_4>",
+    "<extra_id_5>",
+    "<extra_id_6>",
+    "<extra_id_7>",
+    "<extra_id_8>",
+    "<extra_id_9>",
+    "<extra_id_10>",
+    "<extra_id_11>",
+    "<extra_id_12>",
+    "<extra_id_13>",
+    "<extra_id_14>",
+    "<extra_id_15>",
+    "<extra_id_16>",
+    "<extra_id_17>",
+    "<extra_id_18>",
+    "<extra_id_19>",
+    "<extra_id_20>",
+    "<extra_id_21>",
+    "<extra_id_22>",
+    "<extra_id_23>",
+    "<extra_id_24>",
+    "<extra_id_25>",
+    "<extra_id_26>",
+    "<extra_id_27>",
+    "<extra_id_28>",
+    "<extra_id_29>",
+    "<extra_id_30>",
+    "<extra_id_31>",
+    "<extra_id_32>",
+    "<extra_id_33>",
+    "<extra_id_34>",
+    "<extra_id_35>",
+    "<extra_id_36>",
+    "<extra_id_37>",
+    "<extra_id_38>",
+    "<extra_id_39>",
+    "<extra_id_40>",
+    "<extra_id_41>",
+    "<extra_id_42>",
+    "<extra_id_43>",
+    "<extra_id_44>",
+    "<extra_id_45>",
+    "<extra_id_46>",
+    "<extra_id_47>",
+    "<extra_id_48>",
+    "<extra_id_49>",
+    "<extra_id_50>",
+    "<extra_id_51>",
+    "<extra_id_52>",
+    "<extra_id_53>",
+    "<extra_id_54>",
+    "<extra_id_55>",
+    "<extra_id_56>",
+    "<extra_id_57>",
+    "<extra_id_58>",
+    "<extra_id_59>",
+    "<extra_id_60>",
+    "<extra_id_61>",
+    "<extra_id_62>",
+    "<extra_id_63>",
+    "<extra_id_64>",
+    "<extra_id_65>",
+    "<extra_id_66>",
+    "<extra_id_67>",
+    "<extra_id_68>",
+    "<extra_id_69>",
+    "<extra_id_70>",
+    "<extra_id_71>",
+    "<extra_id_72>",
+    "<extra_id_73>",
+    "<extra_id_74>",
+    "<extra_id_75>",
+    "<extra_id_76>",
+    "<extra_id_77>",
+    "<extra_id_78>",
+    "<extra_id_79>",
+    "<extra_id_80>",
+    "<extra_id_81>",
+    "<extra_id_82>",
+    "<extra_id_83>",
+    "<extra_id_84>",
+    "<extra_id_85>",
+    "<extra_id_86>",
+    "<extra_id_87>",
+    "<extra_id_88>",
+    "<extra_id_89>",
+    "<extra_id_90>",
+    "<extra_id_91>",
+    "<extra_id_92>",
+    "<extra_id_93>",
+    "<extra_id_94>",
+    "<extra_id_95>",
+    "<extra_id_96>",
+    "<extra_id_97>",
+    "<extra_id_98>",
+    "<extra_id_99>"
+  ],
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "</s>",
+  "extra_ids": 100,
+  "model_max_length": 128,
+  "pad_token": "<pad>",
+  "sp_model_kwargs": {},
+  "tokenizer_class": "T5Tokenizer",
+  "unk_token": "<unk>"
+}

unet/config.json ADDED Viewed

	@@ -0,0 +1,74 @@

+{
+  "_class_name": "AudioLDM2UNet2DConditionModel",
+  "_diffusers_version": "0.20.0.dev0",
+  "act_fn": "silu",
+  "attention_head_dim": 8,
+  "block_out_channels": [
+    128,
+    256,
+    384,
+    640
+  ],
+  "class_embed_type": null,
+  "class_embeddings_concat": false,
+  "conv_in_kernel": 3,
+  "conv_out_kernel": 3,
+  "cross_attention_dim": [
+    [
+      null,
+      768,
+      1024
+    ],
+    [
+      null,
+      768,
+      1024
+    ],
+    [
+      null,
+      768,
+      1024
+    ],
+    [
+      null,
+      768,
+      1024
+    ]
+  ],
+  "down_block_types": [
+    "DownBlock2D",
+    "CrossAttnDownBlock2D",
+    "CrossAttnDownBlock2D",
+    "CrossAttnDownBlock2D"
+  ],
+  "downsample_padding": 1,
+  "flip_sin_to_cos": true,
+  "freq_shift": 0,
+  "in_channels": 8,
+  "layers_per_block": 2,
+  "mid_block_scale_factor": 1,
+  "mid_block_type": "UNetMidBlock2DCrossAttn",
+  "norm_eps": 1e-05,
+  "norm_num_groups": 32,
+  "num_attention_heads": null,
+  "num_class_embeds": null,
+  "only_cross_attention": false,
+  "out_channels": 8,
+  "projection_class_embeddings_input_dim": null,
+  "resnet_time_scale_shift": "default",
+  "sample_size": 256,
+  "time_cond_proj_dim": null,
+  "time_embedding_act_fn": null,
+  "time_embedding_dim": null,
+  "time_embedding_type": "positional",
+  "timestep_post_act": null,
+  "transformer_layers_per_block": 1,
+  "up_block_types": [
+    "CrossAttnUpBlock2D",
+    "CrossAttnUpBlock2D",
+    "CrossAttnUpBlock2D",
+    "UpBlock2D"
+  ],
+  "upcast_attention": false,
+  "use_linear_projection": false
+}

unet/diffusion_pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9d8d6f8f65e32c7a72aa6c9b7e87debe93e71e5a94669522f3c5ced98b238df9
+size 1388420361

unet/diffusion_pytorch_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:359a5ffb89a844beb2fcfac584aae2cd7cd6e87c3ab1ec4e892ef45d91db77c2
+size 1387964784

vae/config.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "_class_name": "AutoencoderKL",
+  "_diffusers_version": "0.20.0.dev0",
+  "act_fn": "silu",
+  "block_out_channels": [
+    128,
+    256,
+    512
+  ],
+  "down_block_types": [
+    "DownEncoderBlock2D",
+    "DownEncoderBlock2D",
+    "DownEncoderBlock2D"
+  ],
+  "force_upcast": true,
+  "in_channels": 1,
+  "latent_channels": 8,
+  "layers_per_block": 2,
+  "norm_num_groups": 32,
+  "out_channels": 1,
+  "sample_size": 1024,
+  "scaling_factor": 0.4110932946205139,
+  "up_block_types": [
+    "UpDecoderBlock2D",
+    "UpDecoderBlock2D",
+    "UpDecoderBlock2D"
+  ]
+}

vae/diffusion_pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b3494aadd9cf3e3f0cbb4e913f9b35a25da4a3cb709852e204b667ae5890f758
+size 221586761

vae/diffusion_pytorch_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5f8ddddc5c45eddaab38a67a434e8a64486964540ba3fc248a0da7cbd599d4ad
+size 221530308

vocoder/config.json ADDED Viewed

	@@ -0,0 +1,50 @@

+{
+  "architectures": [
+    "SpeechT5HifiGan"
+  ],
+  "initializer_range": 0.01,
+  "leaky_relu_slope": 0.1,
+  "model_in_dim": 64,
+  "model_type": "hifigan",
+  "normalize_before": false,
+  "resblock_dilation_sizes": [
+    [
+      1,
+      3,
+      5
+    ],
+    [
+      1,
+      3,
+      5
+    ],
+    [
+      1,
+      3,
+      5
+    ]
+  ],
+  "resblock_kernel_sizes": [
+    3,
+    7,
+    11
+  ],
+  "sampling_rate": 16000,
+  "torch_dtype": "float32",
+  "transformers_version": "4.32.0.dev0",
+  "upsample_initial_channel": 1024,
+  "upsample_kernel_sizes": [
+    16,
+    16,
+    8,
+    4,
+    4
+  ],
+  "upsample_rates": [
+    5,
+    4,
+    2,
+    2,
+    2
+  ]
+}

vocoder/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d9dc6513c30a5b86c2497712690c04fe74b4aa79fdab6d490b34fcb4e24c590c
+size 221079092

vocoder/pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f9fbefc2b31c85d1dabe98e53d09ac88039af411162a7e641040a9c2b5f62364
+size 221120349