Cleanup: remove 72 unneeded files (255GB) - duplicates, old models, DiffRhythm, Infinity
Browse filesThis view is limited to 50 files because it contains too many changes. See raw diff
- TTS/DiffRhythm/MuQ-MuLan-large/README.md +0 -111
- TTS/DiffRhythm/MuQ-MuLan-large/config.json +0 -41
- TTS/DiffRhythm/MuQ-MuLan-large/pytorch_model.bin +0 -3
- TTS/DiffRhythm/MuQ-large-msd-iter/README.md +0 -113
- TTS/DiffRhythm/MuQ-large-msd-iter/config.json +0 -143
- TTS/DiffRhythm/MuQ-large-msd-iter/model.safetensors +0 -3
- TTS/DiffRhythm/MuQ-large-msd-iter/pytorch_model.bin +0 -3
- TTS/DiffRhythm/cfm_model_v1_2.pt +0 -3
- TTS/DiffRhythm/config.json +0 -13
- TTS/DiffRhythm/vae_model.pt +0 -3
- TTS/DiffRhythm/xlm-roberta-base/README.md +0 -200
- TTS/DiffRhythm/xlm-roberta-base/config.json +0 -25
- TTS/DiffRhythm/xlm-roberta-base/flax_model.msgpack +0 -3
- TTS/DiffRhythm/xlm-roberta-base/model.onnx +0 -3
- TTS/DiffRhythm/xlm-roberta-base/model.safetensors +0 -3
- TTS/DiffRhythm/xlm-roberta-base/pytorch_model.bin +0 -3
- TTS/DiffRhythm/xlm-roberta-base/sentencepiece.bpe.model +0 -3
- TTS/DiffRhythm/xlm-roberta-base/tf_model.h5 +0 -3
- TTS/DiffRhythm/xlm-roberta-base/tokenizer.json +0 -0
- TTS/DiffRhythm/xlm-roberta-base/tokenizer_config.json +0 -1
- ace_step/README.md +0 -122
- ace_step/config.json +0 -35
- audio/MelBandRoformer_fp16.safetensors +0 -3
- diffusion_models/Phantom-Wan-14B_fp8_e4m3fn.safetensors +0 -3
- diffusion_models/Wan2_1-I2V-14B-480p_fp8_e4m3fn_scaled_KJ.safetensors +0 -3
- diffusion_models/Wan2_1-InfiniteTalk-Multi_fp8_e4m3fn_scaled_KJ.safetensors +0 -3
- diffusion_models/Wan2_1-InfiniteTalk-Single_fp8_e4m3fn_scaled_KJ.safetensors +0 -3
- diffusion_models/Wan2_1-InfiniteTalk_Multi_Q8.gguf +0 -3
- diffusion_models/Wan2_1-InfiniteTalk_Single_Q8.gguf +0 -3
- diffusion_models/wan2.1-i2v-14b-480p-Q4_K_M.gguf +0 -3
- loras/FastWan_T2V_14B_480p_lora_rank_128_bf16.safetensors +0 -3
- loras/Wan2.2-Fun-A14B-InP-LOW-HPS2.1_resized_dynamic_avg_rank_15_bf16.safetensors +0 -3
- loras/Wan21_PusaV1_LoRA_14B_rank512_bf16.safetensors +0 -3
- misc/TTS/ACE-Step-v1-3.5B/ace_step_transformer/diffusion_pytorch_model.safetensors +0 -3
- misc/TTS/ACE-Step-v1-3.5B/music_dcae_f8c8/diffusion_pytorch_model.safetensors +0 -3
- misc/TTS/ACE-Step-v1-3.5B/music_vocoder/diffusion_pytorch_model.safetensors +0 -3
- misc/TTS/ACE-Step-v1-3.5B/umt5-base/model.safetensors +0 -3
- misc/ace_step/all_in_one/ace_step_v1_3.5b.safetensors +0 -3
- misc/clip_vision/clip_vision_h.safetensors +0 -3
- misc/diffusion_models/MelBandRoformer_fp16.safetensors +0 -3
- misc/diffusion_models/Wan14BI2VFusioniX_phantom_14B_fp16.safetensors +0 -3
- misc/diffusion_models/Wan2_1-Fun-V1_1-14B-Control-Camera_fp8_e4m3fn.safetensors +0 -3
- misc/diffusion_models/Wan2_1-InfiniteTalk_Multi_Q8.gguf +0 -3
- misc/diffusion_models/Wan2_1-InfiniteTalk_Single_Q8.gguf +0 -3
- misc/diffusion_models/Wan2_1-T2V-14B_fp8_e4m3fn_scaled_KJ.safetensors +0 -3
- misc/diffusion_models/Wan2_2-I2V-A14B-HIGH_fp8_e4m3fn_scaled_KJ.safetensors +0 -3
- misc/diffusion_models/Wan2_2-I2V-A14B-LOW_fp8_e4m3fn_scaled_KJ.safetensors +0 -3
- misc/diffusion_models/wan2.1-i2v-14b-480p-Q4_K_M.gguf +0 -3
- misc/diffusion_models/wan2.2_fun_camera_high_noise_14B_fp8_scaled.safetensors +0 -3
- misc/diffusion_models/wan2.2_fun_camera_low_noise_14B_fp8_scaled.safetensors +0 -3
TTS/DiffRhythm/MuQ-MuLan-large/README.md
DELETED
|
@@ -1,111 +0,0 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: cc-by-nc-4.0
|
| 3 |
-
language:
|
| 4 |
-
- en
|
| 5 |
-
- zh
|
| 6 |
-
pipeline_tag: audio-classification
|
| 7 |
-
tags:
|
| 8 |
-
- music
|
| 9 |
-
---
|
| 10 |
-
|
| 11 |
-
# MuQ & MuQ-MuLan
|
| 12 |
-
|
| 13 |
-
<div>
|
| 14 |
-
<a href='#'><img alt="Static Badge" src="https://img.shields.io/badge/Python-3.8%2B-blue?logo=python&logoColor=white"></a>
|
| 15 |
-
<a href='https://arxiv.org/abs/2501.01108'><img alt="Static Badge" src="https://img.shields.io/badge/arXiv-2501.01108-%23b31b1b?logo=arxiv&link=https%3A%2F%2Farxiv.org%2F"></a>
|
| 16 |
-
<a href='https://huggingface.co/OpenMuQ'><img alt="Static Badge" src="https://img.shields.io/badge/huggingface-OpenMuQ-%23FFD21E?logo=huggingface&link=https%3A%2F%2Fhuggingface.co%2FOpenMuQ"></a>
|
| 17 |
-
<a href='https://pytorch.org/'><img alt="Static Badge" src="https://img.shields.io/badge/framework-PyTorch-%23EE4C2C?logo=pytorch"></a>
|
| 18 |
-
<a href='https://pypi.org/project/muq'><img alt="Static Badge" src="https://img.shields.io/badge/pip%20install-muq-green?logo=PyPI&logoColor=white&link=https%3A%2F%2Fpypi.org%2Fproject%2Fmuq"></a>
|
| 19 |
-
</div>
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
This is the official repository for the paper *"**MuQ**: Self-Supervised **Mu**sic Representation Learning
|
| 23 |
-
with Mel Residual Vector **Q**uantization"*. For more detailed information, we strongly recommend referring to https://github.com/tencent-ailab/MuQ and the [paper]((https://arxiv.org/abs/2501.01108)).
|
| 24 |
-
|
| 25 |
-
In this repo, the following models are released:
|
| 26 |
-
|
| 27 |
-
- **MuQ**(see [this link](https://huggingface.co/OpenMuQ/MuQ-large-msd-iter)): A large music foundation model pre-trained via Self-Supervised Learning (SSL), achieving SOTA in various MIR tasks.
|
| 28 |
-
- **MuQ-MuLan**(see [this link](https://huggingface.co/OpenMuQ/MuQ-MuLan-large)): A music-text joint embedding model trained via contrastive learning, supporting both English and Chinese texts.
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
## Usage
|
| 32 |
-
|
| 33 |
-
To begin with, please use pip to install the official `muq` lib, and ensure that your `python>=3.8`:
|
| 34 |
-
```bash
|
| 35 |
-
pip3 install muq
|
| 36 |
-
```
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
Using **MuQ-MuLan** to extract the music and text embeddings and calculate the similarity:
|
| 40 |
-
```python
|
| 41 |
-
import torch, librosa
|
| 42 |
-
from muq import MuQMuLan
|
| 43 |
-
|
| 44 |
-
# This will automatically fetch checkpoints from huggingface
|
| 45 |
-
device = 'cuda'
|
| 46 |
-
mulan = MuQMuLan.from_pretrained("OpenMuQ/MuQ-MuLan-large")
|
| 47 |
-
mulan = mulan.to(device).eval()
|
| 48 |
-
|
| 49 |
-
# Extract music embeddings
|
| 50 |
-
wav, sr = librosa.load("path/to/music_audio.wav", sr = 24000)
|
| 51 |
-
wavs = torch.tensor(wav).unsqueeze(0).to(device)
|
| 52 |
-
with torch.no_grad():
|
| 53 |
-
audio_embeds = mulan(wavs = wavs)
|
| 54 |
-
|
| 55 |
-
# Extract text embeddings (texts can be in English or Chinese)
|
| 56 |
-
texts = ["classical genres, hopeful mood, piano.", "一首适合海边风景的小提琴曲,节奏欢快"]
|
| 57 |
-
with torch.no_grad():
|
| 58 |
-
text_embeds = mulan(texts = texts)
|
| 59 |
-
|
| 60 |
-
# Calculate dot product similarity
|
| 61 |
-
sim = mulan.calc_similarity(audio_embeds, text_embeds)
|
| 62 |
-
print(sim)
|
| 63 |
-
```
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
To extract music audio features using **MuQ**:
|
| 67 |
-
```python
|
| 68 |
-
import torch, librosa
|
| 69 |
-
from muq import MuQ
|
| 70 |
-
|
| 71 |
-
device = 'cuda'
|
| 72 |
-
wav, sr = librosa.load("path/to/music_audio.wav", sr = 24000)
|
| 73 |
-
wavs = torch.tensor(wav).unsqueeze(0).to(device)
|
| 74 |
-
|
| 75 |
-
# This will automatically fetch the checkpoint from huggingface
|
| 76 |
-
muq = MuQ.from_pretrained("OpenMuQ/MuQ-large-msd-iter")
|
| 77 |
-
muq = muq.to(device).eval()
|
| 78 |
-
|
| 79 |
-
with torch.no_grad():
|
| 80 |
-
output = muq(wavs, output_hidden_states=True)
|
| 81 |
-
|
| 82 |
-
print('Total number of layers: ', len(output.hidden_states))
|
| 83 |
-
print('Feature shape: ', output.last_hidden_state.shape)
|
| 84 |
-
|
| 85 |
-
```
|
| 86 |
-
|
| 87 |
-
## Model Checkpoints
|
| 88 |
-
|
| 89 |
-
| Model Name | Parameters | Data | HuggingFace🤗 |
|
| 90 |
-
| ----------- | --- | --- | ----------- |
|
| 91 |
-
| MuQ | ~300M | MSD dataset | [OpenMuQ/MuQ-large-msd-iter](https://huggingface.co/OpenMuQ/MuQ-large-msd-iter) |
|
| 92 |
-
| MuQ-MuLan | ~700M | music-text pairs | [OpenMuQ/MuQ-MuLan-large](https://huggingface.co/OpenMuQ/MuQ-MuLan-large) |
|
| 93 |
-
|
| 94 |
-
**Note**: Please note that the open-sourced MuQ was trained on the Million Song Dataset. Due to differences in dataset size, the open-sourced model may not achieve the same level of performance as reported in the paper.
|
| 95 |
-
|
| 96 |
-
## License
|
| 97 |
-
|
| 98 |
-
The code is released under the MIT license.
|
| 99 |
-
|
| 100 |
-
The model weights (MuQ-large-msd-iter, MuQ-MuLan-large) are released under the CC-BY-NC 4.0 license.
|
| 101 |
-
|
| 102 |
-
## Citation
|
| 103 |
-
|
| 104 |
-
```
|
| 105 |
-
@article{zhu2025muq,
|
| 106 |
-
title={MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization},
|
| 107 |
-
author={Haina Zhu and Yizhi Zhou and Hangting Chen and Jianwei Yu and Ziyang Ma and Rongzhi Gu and Yi Luo and Wei Tan and Xie Chen},
|
| 108 |
-
journal={arXiv preprint arXiv:2501.01108},
|
| 109 |
-
year={2025}
|
| 110 |
-
}
|
| 111 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
TTS/DiffRhythm/MuQ-MuLan-large/config.json
DELETED
|
@@ -1,41 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"mulan": {
|
| 3 |
-
"sr": 24000,
|
| 4 |
-
"clip_secs": 10,
|
| 5 |
-
"dim_latent": 512,
|
| 6 |
-
"decoupled_contrastive_learning": true,
|
| 7 |
-
"hierarchical_contrastive_loss": false,
|
| 8 |
-
"hierarchical_contrastive_loss_layers": null,
|
| 9 |
-
"sigmoid_contrastive_loss": false,
|
| 10 |
-
"rank_contrast": true
|
| 11 |
-
},
|
| 12 |
-
"audio_model": {
|
| 13 |
-
"name": "OpenMuQ/MuQ-large-msd-iter",
|
| 14 |
-
"model_dim": 1024,
|
| 15 |
-
"use_layer_idx": -1
|
| 16 |
-
},
|
| 17 |
-
"text_model": {
|
| 18 |
-
"name": "xlm-roberta-base",
|
| 19 |
-
"model_dim": null,
|
| 20 |
-
"use_layer_idx": -1
|
| 21 |
-
},
|
| 22 |
-
"audio_transformer": {
|
| 23 |
-
"dim": 768,
|
| 24 |
-
"tf_depth": 0,
|
| 25 |
-
"heads": 8,
|
| 26 |
-
"dim_head": 64,
|
| 27 |
-
"attn_dropout": 0,
|
| 28 |
-
"ff_dropout": 0,
|
| 29 |
-
"ff_mult": 4
|
| 30 |
-
},
|
| 31 |
-
"text_transformer": {
|
| 32 |
-
"dim": 768,
|
| 33 |
-
"tf_depth": 8,
|
| 34 |
-
"max_seq_len": 1024,
|
| 35 |
-
"dim_head": 64,
|
| 36 |
-
"heads": 8,
|
| 37 |
-
"attn_dropout": 0,
|
| 38 |
-
"ff_dropout": 0,
|
| 39 |
-
"ff_mult": 4
|
| 40 |
-
}
|
| 41 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
TTS/DiffRhythm/MuQ-MuLan-large/pytorch_model.bin
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:d42ae3f7cb9b66759ee0089ddc70e2f28b130c2d8ba621457358272d32dd0444
|
| 3 |
-
size 2653954401
|
|
|
|
|
|
|
|
|
|
|
|
TTS/DiffRhythm/MuQ-large-msd-iter/README.md
DELETED
|
@@ -1,113 +0,0 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: cc-by-nc-4.0
|
| 3 |
-
language:
|
| 4 |
-
- en
|
| 5 |
-
- zh
|
| 6 |
-
pipeline_tag: audio-classification
|
| 7 |
-
tags:
|
| 8 |
-
- music
|
| 9 |
-
---
|
| 10 |
-
|
| 11 |
-
# MuQ & MuQ-MuLan
|
| 12 |
-
|
| 13 |
-
<div>
|
| 14 |
-
<a href='#'><img alt="Static Badge" src="https://img.shields.io/badge/Python-3.8%2B-blue?logo=python&logoColor=white"></a>
|
| 15 |
-
<a href='https://arxiv.org/abs/2501.01108'><img alt="Static Badge" src="https://img.shields.io/badge/arXiv-2501.01108-%23b31b1b?logo=arxiv&link=https%3A%2F%2Farxiv.org%2F"></a>
|
| 16 |
-
<a href='https://huggingface.co/OpenMuQ'><img alt="Static Badge" src="https://img.shields.io/badge/huggingface-OpenMuQ-%23FFD21E?logo=huggingface&link=https%3A%2F%2Fhuggingface.co%2FOpenMuQ"></a>
|
| 17 |
-
<a href='https://pytorch.org/'><img alt="Static Badge" src="https://img.shields.io/badge/framework-PyTorch-%23EE4C2C?logo=pytorch"></a>
|
| 18 |
-
<a href='https://pypi.org/project/muq'><img alt="Static Badge" src="https://img.shields.io/badge/pip%20install-muq-green?logo=PyPI&logoColor=white&link=https%3A%2F%2Fpypi.org%2Fproject%2Fmuq"></a>
|
| 19 |
-
</div>
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
This is the official repository for the paper *"**MuQ**: Self-Supervised **Mu**sic Representation Learning
|
| 23 |
-
with Mel Residual Vector **Q**uantization"*. For more detailed information, we strongly recommend referring to https://github.com/tencent-ailab/MuQ and the [paper]((https://arxiv.org/abs/2501.01108)).
|
| 24 |
-
|
| 25 |
-
In this repo, the following models are released:
|
| 26 |
-
|
| 27 |
-
- **MuQ**(see [this link](https://huggingface.co/OpenMuQ/MuQ-large-msd-iter)): A large music foundation model pre-trained via Self-Supervised Learning (SSL), achieving SOTA in various MIR tasks.
|
| 28 |
-
- **MuQ-MuLan**(see [this link](https://huggingface.co/OpenMuQ/MuQ-MuLan-large)): A music-text joint embedding model trained via contrastive learning, supporting both English and Chinese texts.
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
## Usage
|
| 32 |
-
|
| 33 |
-
To begin with, please use pip to install the official `muq` lib, and ensure that your `python>=3.8`:
|
| 34 |
-
```bash
|
| 35 |
-
pip3 install muq
|
| 36 |
-
```
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
To extract music audio features using **MuQ**:
|
| 41 |
-
```python
|
| 42 |
-
import torch, librosa
|
| 43 |
-
from muq import MuQ
|
| 44 |
-
|
| 45 |
-
device = 'cuda'
|
| 46 |
-
wav, sr = librosa.load("path/to/music_audio.wav", sr = 24000)
|
| 47 |
-
wavs = torch.tensor(wav).unsqueeze(0).to(device)
|
| 48 |
-
|
| 49 |
-
# This will automatically fetch the checkpoint from huggingface
|
| 50 |
-
muq = MuQ.from_pretrained("OpenMuQ/MuQ-large-msd-iter")
|
| 51 |
-
muq = muq.to(device).eval()
|
| 52 |
-
|
| 53 |
-
with torch.no_grad():
|
| 54 |
-
output = muq(wavs, output_hidden_states=True)
|
| 55 |
-
|
| 56 |
-
print('Total number of layers: ', len(output.hidden_states))
|
| 57 |
-
print('Feature shape: ', output.last_hidden_state.shape)
|
| 58 |
-
|
| 59 |
-
```
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
Using **MuQ-MuLan** to extract the music and text embeddings and calculate the similarity:
|
| 64 |
-
```python
|
| 65 |
-
import torch, librosa
|
| 66 |
-
from muq import MuQMuLan
|
| 67 |
-
|
| 68 |
-
# This will automatically fetch checkpoints from huggingface
|
| 69 |
-
device = 'cuda'
|
| 70 |
-
mulan = MuQMuLan.from_pretrained("OpenMuQ/MuQ-MuLan-large")
|
| 71 |
-
mulan = mulan.to(device).eval()
|
| 72 |
-
|
| 73 |
-
# Extract music embeddings
|
| 74 |
-
wav, sr = librosa.load("path/to/music_audio.wav", sr = 24000)
|
| 75 |
-
wavs = torch.tensor(wav).unsqueeze(0).to(device)
|
| 76 |
-
with torch.no_grad():
|
| 77 |
-
audio_embeds = mulan(wavs = wavs)
|
| 78 |
-
|
| 79 |
-
# Extract text embeddings (texts can be in English or Chinese)
|
| 80 |
-
texts = ["classical genres, hopeful mood, piano.", "一首适合海边风景的小提琴曲,节奏欢快"]
|
| 81 |
-
with torch.no_grad():
|
| 82 |
-
text_embeds = mulan(texts = texts)
|
| 83 |
-
|
| 84 |
-
# Calculate dot product similarity
|
| 85 |
-
sim = mulan.calc_similarity(audio_embeds, text_embeds)
|
| 86 |
-
print(sim)
|
| 87 |
-
```
|
| 88 |
-
|
| 89 |
-
## Model Checkpoints
|
| 90 |
-
|
| 91 |
-
| Model Name | Parameters | Data | HuggingFace🤗 |
|
| 92 |
-
| ----------- | --- | --- | ----------- |
|
| 93 |
-
| MuQ | ~300M | MSD dataset | [OpenMuQ/MuQ-large-msd-iter](https://huggingface.co/OpenMuQ/MuQ-large-msd-iter) |
|
| 94 |
-
| MuQ-MuLan | ~700M | music-text pairs | [OpenMuQ/MuQ-MuLan-large](https://huggingface.co/OpenMuQ/MuQ-MuLan-large) |
|
| 95 |
-
|
| 96 |
-
**Note**: Please note that the open-sourced MuQ was trained on the Million Song Dataset. Due to differences in dataset size, the open-sourced model may not achieve the same level of performance as reported in the paper.
|
| 97 |
-
|
| 98 |
-
## License
|
| 99 |
-
|
| 100 |
-
The code is released under the MIT license.
|
| 101 |
-
|
| 102 |
-
The model weights (MuQ-large-msd-iter, MuQ-MuLan-large) are released under the CC-BY-NC 4.0 license.
|
| 103 |
-
|
| 104 |
-
## Citation
|
| 105 |
-
|
| 106 |
-
```
|
| 107 |
-
@article{zhu2025muq,
|
| 108 |
-
title={MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization},
|
| 109 |
-
author={Haina Zhu and Yizhi Zhou and Hangting Chen and Jianwei Yu and Ziyang Ma and Rongzhi Gu and Yi Luo and Wei Tan and Xie Chen},
|
| 110 |
-
journal={arXiv preprint arXiv:2501.01108},
|
| 111 |
-
year={2025}
|
| 112 |
-
}
|
| 113 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
TTS/DiffRhythm/MuQ-large-msd-iter/config.json
DELETED
|
@@ -1,143 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"codebook_dim": 16,
|
| 3 |
-
"codebook_size": 8192,
|
| 4 |
-
"conv_dim": 512,
|
| 5 |
-
"encoder_depth": 12,
|
| 6 |
-
"encoder_dim": 1024,
|
| 7 |
-
"features": [
|
| 8 |
-
"melspec_2048"
|
| 9 |
-
],
|
| 10 |
-
"hop_length": 240,
|
| 11 |
-
"is_flash": false,
|
| 12 |
-
"label_rate": 25,
|
| 13 |
-
"mask_hop": 0.4,
|
| 14 |
-
"mask_prob": 0.6,
|
| 15 |
-
"n_mels": 128,
|
| 16 |
-
"num_codebooks": 1,
|
| 17 |
-
"recon_loss_ratio": null,
|
| 18 |
-
"resume_checkpoint": null,
|
| 19 |
-
"rvq_ckpt_path": null,
|
| 20 |
-
"rvq_multi_layer_num": 1,
|
| 21 |
-
"rvq_n_codebooks": 8,
|
| 22 |
-
"stat": {
|
| 23 |
-
"melspec_2048_cnt": 14282760192,
|
| 24 |
-
"melspec_2048_mean": 6.768444971712967,
|
| 25 |
-
"melspec_2048_std": 18.417922652295623
|
| 26 |
-
},
|
| 27 |
-
"use_encodec_target": false,
|
| 28 |
-
"use_rvq_target": true,
|
| 29 |
-
"use_vq_target": false,
|
| 30 |
-
"w2v2_config": {
|
| 31 |
-
"activation_dropout": 0.1,
|
| 32 |
-
"adapter_kernel_size": 3,
|
| 33 |
-
"adapter_stride": 2,
|
| 34 |
-
"add_adapter": false,
|
| 35 |
-
"apply_spec_augment": true,
|
| 36 |
-
"architectures": [
|
| 37 |
-
"Wav2Vec2ConformerForCTC"
|
| 38 |
-
],
|
| 39 |
-
"attention_dropout": 0.1,
|
| 40 |
-
"bos_token_id": 1,
|
| 41 |
-
"classifier_proj_size": 256,
|
| 42 |
-
"codevector_dim": 768,
|
| 43 |
-
"conformer_conv_dropout": 0.1,
|
| 44 |
-
"contrastive_logits_temperature": 0.1,
|
| 45 |
-
"conv_bias": true,
|
| 46 |
-
"conv_depthwise_kernel_size": 31,
|
| 47 |
-
"conv_dim": [
|
| 48 |
-
512,
|
| 49 |
-
512,
|
| 50 |
-
512,
|
| 51 |
-
512,
|
| 52 |
-
512,
|
| 53 |
-
512,
|
| 54 |
-
512
|
| 55 |
-
],
|
| 56 |
-
"conv_kernel": [
|
| 57 |
-
10,
|
| 58 |
-
3,
|
| 59 |
-
3,
|
| 60 |
-
3,
|
| 61 |
-
3,
|
| 62 |
-
2,
|
| 63 |
-
2
|
| 64 |
-
],
|
| 65 |
-
"conv_stride": [
|
| 66 |
-
5,
|
| 67 |
-
2,
|
| 68 |
-
2,
|
| 69 |
-
2,
|
| 70 |
-
2,
|
| 71 |
-
2,
|
| 72 |
-
2
|
| 73 |
-
],
|
| 74 |
-
"ctc_loss_reduction": "sum",
|
| 75 |
-
"ctc_zero_infinity": false,
|
| 76 |
-
"diversity_loss_weight": 0.1,
|
| 77 |
-
"do_stable_layer_norm": true,
|
| 78 |
-
"eos_token_id": 2,
|
| 79 |
-
"feat_extract_activation": "gelu",
|
| 80 |
-
"feat_extract_dropout": 0.0,
|
| 81 |
-
"feat_extract_norm": "layer",
|
| 82 |
-
"feat_proj_dropout": 0.1,
|
| 83 |
-
"feat_quantizer_dropout": 0.0,
|
| 84 |
-
"final_dropout": 0.1,
|
| 85 |
-
"gradient_checkpointing": false,
|
| 86 |
-
"hidden_act": "swish",
|
| 87 |
-
"hidden_dropout": 0.1,
|
| 88 |
-
"hidden_dropout_prob": 0.1,
|
| 89 |
-
"hidden_size": 1024,
|
| 90 |
-
"initializer_range": 0.02,
|
| 91 |
-
"intermediate_size": 4096,
|
| 92 |
-
"layer_norm_eps": 1e-05,
|
| 93 |
-
"layerdrop": 0.0,
|
| 94 |
-
"mask_feature_length": 10,
|
| 95 |
-
"mask_feature_min_masks": 0,
|
| 96 |
-
"mask_feature_prob": 0.0,
|
| 97 |
-
"mask_time_length": 10,
|
| 98 |
-
"mask_time_min_masks": 2,
|
| 99 |
-
"mask_time_prob": 0.05,
|
| 100 |
-
"max_source_positions": 5000,
|
| 101 |
-
"model_type": "wav2vec2-conformer",
|
| 102 |
-
"num_adapter_layers": 3,
|
| 103 |
-
"num_attention_heads": 16,
|
| 104 |
-
"num_codevector_groups": 2,
|
| 105 |
-
"num_codevectors_per_group": 320,
|
| 106 |
-
"num_conv_pos_embedding_groups": 16,
|
| 107 |
-
"num_conv_pos_embeddings": 128,
|
| 108 |
-
"num_feat_extract_layers": 7,
|
| 109 |
-
"num_hidden_layers": 24,
|
| 110 |
-
"num_negatives": 100,
|
| 111 |
-
"output_hidden_size": 1024,
|
| 112 |
-
"pad_token_id": 0,
|
| 113 |
-
"position_embeddings_type": "rotary",
|
| 114 |
-
"proj_codevector_dim": 768,
|
| 115 |
-
"rotary_embedding_base": 10000,
|
| 116 |
-
"tdnn_dilation": [
|
| 117 |
-
1,
|
| 118 |
-
2,
|
| 119 |
-
3,
|
| 120 |
-
1,
|
| 121 |
-
1
|
| 122 |
-
],
|
| 123 |
-
"tdnn_dim": [
|
| 124 |
-
512,
|
| 125 |
-
512,
|
| 126 |
-
512,
|
| 127 |
-
512,
|
| 128 |
-
1500
|
| 129 |
-
],
|
| 130 |
-
"tdnn_kernel": [
|
| 131 |
-
5,
|
| 132 |
-
3,
|
| 133 |
-
3,
|
| 134 |
-
1,
|
| 135 |
-
1
|
| 136 |
-
],
|
| 137 |
-
"torch_dtype": "float32",
|
| 138 |
-
"transformers_version": "4.19.0.dev0",
|
| 139 |
-
"use_weighted_layer_sum": false,
|
| 140 |
-
"vocab_size": 32,
|
| 141 |
-
"xvector_output_dim": 512
|
| 142 |
-
}
|
| 143 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
TTS/DiffRhythm/MuQ-large-msd-iter/model.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:273febab2be02872c37d2c37e48a9d6c52c1c9392f3eeeabd498efa281ccb7a6
|
| 3 |
-
size 1333825096
|
|
|
|
|
|
|
|
|
|
|
|
TTS/DiffRhythm/MuQ-large-msd-iter/pytorch_model.bin
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:334df3de2832ec1acfd8b6ce54e7de4073401fe821f7ec0ad0d954832be2d26a
|
| 3 |
-
size 1333965438
|
|
|
|
|
|
|
|
|
|
|
|
TTS/DiffRhythm/cfm_model_v1_2.pt
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:3e819b317ce2cf1fb22f386d74f351b697204ec1f57f03edfe50dbca71cf0768
|
| 3 |
-
size 2218709125
|
|
|
|
|
|
|
|
|
|
|
|
TTS/DiffRhythm/config.json
DELETED
|
@@ -1,13 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"model_type": "diffrhythm",
|
| 3 |
-
"model": {
|
| 4 |
-
"dim": 2048,
|
| 5 |
-
"depth": 16,
|
| 6 |
-
"heads": 32,
|
| 7 |
-
"ff_mult": 4,
|
| 8 |
-
"text_dim": 512,
|
| 9 |
-
"conv_layers": 4,
|
| 10 |
-
"mel_dim": 64,
|
| 11 |
-
"text_num_embeds": 363
|
| 12 |
-
}
|
| 13 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
TTS/DiffRhythm/vae_model.pt
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:712693f27299937c6ccf1a6d6f1d9b45c7c8c11210d3b0cbb0f36181465ba29f
|
| 3 |
-
size 624520127
|
|
|
|
|
|
|
|
|
|
|
|
TTS/DiffRhythm/xlm-roberta-base/README.md
DELETED
|
@@ -1,200 +0,0 @@
|
|
| 1 |
-
---
|
| 2 |
-
tags:
|
| 3 |
-
- exbert
|
| 4 |
-
language:
|
| 5 |
-
- multilingual
|
| 6 |
-
- af
|
| 7 |
-
- am
|
| 8 |
-
- ar
|
| 9 |
-
- as
|
| 10 |
-
- az
|
| 11 |
-
- be
|
| 12 |
-
- bg
|
| 13 |
-
- bn
|
| 14 |
-
- br
|
| 15 |
-
- bs
|
| 16 |
-
- ca
|
| 17 |
-
- cs
|
| 18 |
-
- cy
|
| 19 |
-
- da
|
| 20 |
-
- de
|
| 21 |
-
- el
|
| 22 |
-
- en
|
| 23 |
-
- eo
|
| 24 |
-
- es
|
| 25 |
-
- et
|
| 26 |
-
- eu
|
| 27 |
-
- fa
|
| 28 |
-
- fi
|
| 29 |
-
- fr
|
| 30 |
-
- fy
|
| 31 |
-
- ga
|
| 32 |
-
- gd
|
| 33 |
-
- gl
|
| 34 |
-
- gu
|
| 35 |
-
- ha
|
| 36 |
-
- he
|
| 37 |
-
- hi
|
| 38 |
-
- hr
|
| 39 |
-
- hu
|
| 40 |
-
- hy
|
| 41 |
-
- id
|
| 42 |
-
- is
|
| 43 |
-
- it
|
| 44 |
-
- ja
|
| 45 |
-
- jv
|
| 46 |
-
- ka
|
| 47 |
-
- kk
|
| 48 |
-
- km
|
| 49 |
-
- kn
|
| 50 |
-
- ko
|
| 51 |
-
- ku
|
| 52 |
-
- ky
|
| 53 |
-
- la
|
| 54 |
-
- lo
|
| 55 |
-
- lt
|
| 56 |
-
- lv
|
| 57 |
-
- mg
|
| 58 |
-
- mk
|
| 59 |
-
- ml
|
| 60 |
-
- mn
|
| 61 |
-
- mr
|
| 62 |
-
- ms
|
| 63 |
-
- my
|
| 64 |
-
- ne
|
| 65 |
-
- nl
|
| 66 |
-
- no
|
| 67 |
-
- om
|
| 68 |
-
- or
|
| 69 |
-
- pa
|
| 70 |
-
- pl
|
| 71 |
-
- ps
|
| 72 |
-
- pt
|
| 73 |
-
- ro
|
| 74 |
-
- ru
|
| 75 |
-
- sa
|
| 76 |
-
- sd
|
| 77 |
-
- si
|
| 78 |
-
- sk
|
| 79 |
-
- sl
|
| 80 |
-
- so
|
| 81 |
-
- sq
|
| 82 |
-
- sr
|
| 83 |
-
- su
|
| 84 |
-
- sv
|
| 85 |
-
- sw
|
| 86 |
-
- ta
|
| 87 |
-
- te
|
| 88 |
-
- th
|
| 89 |
-
- tl
|
| 90 |
-
- tr
|
| 91 |
-
- ug
|
| 92 |
-
- uk
|
| 93 |
-
- ur
|
| 94 |
-
- uz
|
| 95 |
-
- vi
|
| 96 |
-
- xh
|
| 97 |
-
- yi
|
| 98 |
-
- zh
|
| 99 |
-
license: mit
|
| 100 |
-
---
|
| 101 |
-
|
| 102 |
-
# XLM-RoBERTa (base-sized model)
|
| 103 |
-
|
| 104 |
-
XLM-RoBERTa model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. It was introduced in the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Conneau et al. and first released in [this repository](https://github.com/pytorch/fairseq/tree/master/examples/xlmr).
|
| 105 |
-
|
| 106 |
-
Disclaimer: The team releasing XLM-RoBERTa did not write a model card for this model so this model card has been written by the Hugging Face team.
|
| 107 |
-
|
| 108 |
-
## Model description
|
| 109 |
-
|
| 110 |
-
XLM-RoBERTa is a multilingual version of RoBERTa. It is pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages.
|
| 111 |
-
|
| 112 |
-
RoBERTa is a transformers model pretrained on a large corpus in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts.
|
| 113 |
-
|
| 114 |
-
More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
|
| 115 |
-
|
| 116 |
-
This way, the model learns an inner representation of 100 languages that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the XLM-RoBERTa model as inputs.
|
| 117 |
-
|
| 118 |
-
## Intended uses & limitations
|
| 119 |
-
|
| 120 |
-
You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?search=xlm-roberta) to look for fine-tuned versions on a task that interests you.
|
| 121 |
-
|
| 122 |
-
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation, you should look at models like GPT2.
|
| 123 |
-
|
| 124 |
-
## Usage
|
| 125 |
-
|
| 126 |
-
You can use this model directly with a pipeline for masked language modeling:
|
| 127 |
-
|
| 128 |
-
```python
|
| 129 |
-
>>> from transformers import pipeline
|
| 130 |
-
>>> unmasker = pipeline('fill-mask', model='xlm-roberta-base')
|
| 131 |
-
>>> unmasker("Hello I'm a <mask> model.")
|
| 132 |
-
|
| 133 |
-
[{'score': 0.10563907772302628,
|
| 134 |
-
'sequence': "Hello I'm a fashion model.",
|
| 135 |
-
'token': 54543,
|
| 136 |
-
'token_str': 'fashion'},
|
| 137 |
-
{'score': 0.08015287667512894,
|
| 138 |
-
'sequence': "Hello I'm a new model.",
|
| 139 |
-
'token': 3525,
|
| 140 |
-
'token_str': 'new'},
|
| 141 |
-
{'score': 0.033413201570510864,
|
| 142 |
-
'sequence': "Hello I'm a model model.",
|
| 143 |
-
'token': 3299,
|
| 144 |
-
'token_str': 'model'},
|
| 145 |
-
{'score': 0.030217764899134636,
|
| 146 |
-
'sequence': "Hello I'm a French model.",
|
| 147 |
-
'token': 92265,
|
| 148 |
-
'token_str': 'French'},
|
| 149 |
-
{'score': 0.026436051353812218,
|
| 150 |
-
'sequence': "Hello I'm a sexy model.",
|
| 151 |
-
'token': 17473,
|
| 152 |
-
'token_str': 'sexy'}]
|
| 153 |
-
```
|
| 154 |
-
|
| 155 |
-
Here is how to use this model to get the features of a given text in PyTorch:
|
| 156 |
-
|
| 157 |
-
```python
|
| 158 |
-
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
| 159 |
-
|
| 160 |
-
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')
|
| 161 |
-
model = AutoModelForMaskedLM.from_pretrained("xlm-roberta-base")
|
| 162 |
-
|
| 163 |
-
# prepare input
|
| 164 |
-
text = "Replace me by any text you'd like."
|
| 165 |
-
encoded_input = tokenizer(text, return_tensors='pt')
|
| 166 |
-
|
| 167 |
-
# forward pass
|
| 168 |
-
output = model(**encoded_input)
|
| 169 |
-
```
|
| 170 |
-
|
| 171 |
-
### BibTeX entry and citation info
|
| 172 |
-
|
| 173 |
-
```bibtex
|
| 174 |
-
@article{DBLP:journals/corr/abs-1911-02116,
|
| 175 |
-
author = {Alexis Conneau and
|
| 176 |
-
Kartikay Khandelwal and
|
| 177 |
-
Naman Goyal and
|
| 178 |
-
Vishrav Chaudhary and
|
| 179 |
-
Guillaume Wenzek and
|
| 180 |
-
Francisco Guzm{\'{a}}n and
|
| 181 |
-
Edouard Grave and
|
| 182 |
-
Myle Ott and
|
| 183 |
-
Luke Zettlemoyer and
|
| 184 |
-
Veselin Stoyanov},
|
| 185 |
-
title = {Unsupervised Cross-lingual Representation Learning at Scale},
|
| 186 |
-
journal = {CoRR},
|
| 187 |
-
volume = {abs/1911.02116},
|
| 188 |
-
year = {2019},
|
| 189 |
-
url = {http://arxiv.org/abs/1911.02116},
|
| 190 |
-
eprinttype = {arXiv},
|
| 191 |
-
eprint = {1911.02116},
|
| 192 |
-
timestamp = {Mon, 11 Nov 2019 18:38:09 +0100},
|
| 193 |
-
biburl = {https://dblp.org/rec/journals/corr/abs-1911-02116.bib},
|
| 194 |
-
bibsource = {dblp computer science bibliography, https://dblp.org}
|
| 195 |
-
}
|
| 196 |
-
```
|
| 197 |
-
|
| 198 |
-
<a href="https://huggingface.co/exbert/?model=xlm-roberta-base">
|
| 199 |
-
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
|
| 200 |
-
</a>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
TTS/DiffRhythm/xlm-roberta-base/config.json
DELETED
|
@@ -1,25 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"architectures": [
|
| 3 |
-
"XLMRobertaForMaskedLM"
|
| 4 |
-
],
|
| 5 |
-
"attention_probs_dropout_prob": 0.1,
|
| 6 |
-
"bos_token_id": 0,
|
| 7 |
-
"eos_token_id": 2,
|
| 8 |
-
"hidden_act": "gelu",
|
| 9 |
-
"hidden_dropout_prob": 0.1,
|
| 10 |
-
"hidden_size": 768,
|
| 11 |
-
"initializer_range": 0.02,
|
| 12 |
-
"intermediate_size": 3072,
|
| 13 |
-
"layer_norm_eps": 1e-05,
|
| 14 |
-
"max_position_embeddings": 514,
|
| 15 |
-
"model_type": "xlm-roberta",
|
| 16 |
-
"num_attention_heads": 12,
|
| 17 |
-
"num_hidden_layers": 12,
|
| 18 |
-
"output_past": true,
|
| 19 |
-
"pad_token_id": 1,
|
| 20 |
-
"position_embedding_type": "absolute",
|
| 21 |
-
"transformers_version": "4.17.0.dev0",
|
| 22 |
-
"type_vocab_size": 1,
|
| 23 |
-
"use_cache": true,
|
| 24 |
-
"vocab_size": 250002
|
| 25 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
TTS/DiffRhythm/xlm-roberta-base/flax_model.msgpack
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:311b6941e02128b01c6a429f55b47b351a86fe53e6802774d87696bcbc465992
|
| 3 |
-
size 1113187999
|
|
|
|
|
|
|
|
|
|
|
|
TTS/DiffRhythm/xlm-roberta-base/model.onnx
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:a76bfe6a405f1a9ace42b2dbd8fbd284dd8127a732ddcf2145b0fc9413b30d40
|
| 3 |
-
size 1881470773
|
|
|
|
|
|
|
|
|
|
|
|
TTS/DiffRhythm/xlm-roberta-base/model.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:6fd4797bc397c3b8b55d6bb5740366b57e6a3ce91c04c77f22aafc0c128e6feb
|
| 3 |
-
size 1115567652
|
|
|
|
|
|
|
|
|
|
|
|
TTS/DiffRhythm/xlm-roberta-base/pytorch_model.bin
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:9d83baaafea92d36de26002c8135a427d55ee6fdc4faaa6e400be4c47724a07e
|
| 3 |
-
size 1115590446
|
|
|
|
|
|
|
|
|
|
|
|
TTS/DiffRhythm/xlm-roberta-base/sentencepiece.bpe.model
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
|
| 3 |
-
size 5069051
|
|
|
|
|
|
|
|
|
|
|
|
TTS/DiffRhythm/xlm-roberta-base/tf_model.h5
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:d1232fb4018ab3a236c29f10aefd190ef844ad994ac74820d9532637bd87b3f4
|
| 3 |
-
size 1112441536
|
|
|
|
|
|
|
|
|
|
|
|
TTS/DiffRhythm/xlm-roberta-base/tokenizer.json
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
TTS/DiffRhythm/xlm-roberta-base/tokenizer_config.json
DELETED
|
@@ -1 +0,0 @@
|
|
| 1 |
-
{"model_max_length": 512}
|
|
|
|
|
|
ace_step/README.md
DELETED
|
@@ -1,122 +0,0 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
tags:
|
| 4 |
-
- music
|
| 5 |
-
- text2music
|
| 6 |
-
- acestep
|
| 7 |
-
pipeline_tag: text-to-audio
|
| 8 |
-
language:
|
| 9 |
-
- en
|
| 10 |
-
- zh
|
| 11 |
-
- de
|
| 12 |
-
- fr
|
| 13 |
-
- es
|
| 14 |
-
- it
|
| 15 |
-
- pt
|
| 16 |
-
- pl
|
| 17 |
-
- tr
|
| 18 |
-
- ru
|
| 19 |
-
- cs
|
| 20 |
-
- nl
|
| 21 |
-
- ar
|
| 22 |
-
- ja
|
| 23 |
-
- hu
|
| 24 |
-
- ko
|
| 25 |
-
- hi
|
| 26 |
-
---
|
| 27 |
-
|
| 28 |
-
# ACE-Step: A Step Towards Music Generation Foundation Model
|
| 29 |
-
|
| 30 |
-

|
| 31 |
-
|
| 32 |
-
## Model Description
|
| 33 |
-
|
| 34 |
-
ACE-Step is a novel open-source foundation model for music generation that overcomes key limitations of existing approaches through a holistic architectural design. It integrates diffusion-based generation with Sana's Deep Compression AutoEncoder (DCAE) and a lightweight linear transformer, achieving state-of-the-art performance in generation speed, musical coherence, and controllability.
|
| 35 |
-
|
| 36 |
-
**Key Features:**
|
| 37 |
-
- 15× faster than LLM-based baselines (20s for 4-minute music on A100)
|
| 38 |
-
- Superior musical coherence across melody, harmony, and rhythm
|
| 39 |
-
- full-song generation, duration control and accepts natural language descriptions
|
| 40 |
-
|
| 41 |
-
## Uses
|
| 42 |
-
|
| 43 |
-
### Direct Use
|
| 44 |
-
ACE-Step can be used for:
|
| 45 |
-
- Generating original music from text descriptions
|
| 46 |
-
- Music remixing and style transfer
|
| 47 |
-
- edit song lyrics
|
| 48 |
-
|
| 49 |
-
### Downstream Use
|
| 50 |
-
The model serves as a foundation for:
|
| 51 |
-
- Voice cloning applications
|
| 52 |
-
- Specialized music generation (rap, jazz, etc.)
|
| 53 |
-
- Music production tools
|
| 54 |
-
- Creative AI assistants
|
| 55 |
-
|
| 56 |
-
### Out-of-Scope Use
|
| 57 |
-
The model should not be used for:
|
| 58 |
-
- Generating copyrighted content without permission
|
| 59 |
-
- Creating harmful or offensive content
|
| 60 |
-
- Misrepresenting AI-generated music as human-created
|
| 61 |
-
|
| 62 |
-
## How to Get Started
|
| 63 |
-
|
| 64 |
-
see: https://github.com/ace-step/ACE-Step
|
| 65 |
-
|
| 66 |
-
## Hardware Performance
|
| 67 |
-
|
| 68 |
-
| Device | 27 Steps | 60 Steps |
|
| 69 |
-
|---------------|----------|----------|
|
| 70 |
-
| NVIDIA A100 | 27.27x | 12.27x |
|
| 71 |
-
| RTX 4090 | 34.48x | 15.63x |
|
| 72 |
-
| RTX 3090 | 12.76x | 6.48x |
|
| 73 |
-
| M2 Max | 2.27x | 1.03x |
|
| 74 |
-
|
| 75 |
-
*RTF (Real-Time Factor) shown - higher values indicate faster generation*
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
## Limitations
|
| 79 |
-
|
| 80 |
-
- Performance varies by language (top 10 languages perform best)
|
| 81 |
-
- Longer generations (>5 minutes) may lose structural coherence
|
| 82 |
-
- Rare instruments may not render perfectly
|
| 83 |
-
- Output Inconsistency: Highly sensitive to random seeds and input duration, leading to varied "gacha-style" results.
|
| 84 |
-
- Style-specific Weaknesses: Underperforms on certain genres (e.g. Chinese rap/zh_rap) Limited style adherence and musicality ceiling
|
| 85 |
-
- Continuity Artifacts: Unnatural transitions in repainting/extend operations
|
| 86 |
-
- Vocal Quality: Coarse vocal synthesis lacking nuance
|
| 87 |
-
- Control Granularity: Needs finer-grained musical parameter control
|
| 88 |
-
|
| 89 |
-
## Ethical Considerations
|
| 90 |
-
|
| 91 |
-
Users should:
|
| 92 |
-
- Verify originality of generated works
|
| 93 |
-
- Disclose AI involvement
|
| 94 |
-
- Respect cultural elements and copyrights
|
| 95 |
-
- Avoid harmful content generation
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
## Model Details
|
| 99 |
-
|
| 100 |
-
**Developed by:** ACE Studio and StepFun
|
| 101 |
-
**Model type:** Diffusion-based music generation with transformer conditioning
|
| 102 |
-
**License:** Apache 2.0
|
| 103 |
-
**Resources:**
|
| 104 |
-
- [Project Page](https://ace-step.github.io/)
|
| 105 |
-
- [Demo Space](https://huggingface.co/spaces/ACE-Step/ACE-Step)
|
| 106 |
-
- [GitHub Repository](https://github.com/ACE-Step/ACE-Step)
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
## Citation
|
| 110 |
-
|
| 111 |
-
```bibtex
|
| 112 |
-
@misc{gong2025acestep,
|
| 113 |
-
title={ACE-Step: A Step Towards Music Generation Foundation Model},
|
| 114 |
-
author={Junmin Gong, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo},
|
| 115 |
-
howpublished={\url{https://github.com/ace-step/ACE-Step}},
|
| 116 |
-
year={2025},
|
| 117 |
-
note={GitHub repository}
|
| 118 |
-
}
|
| 119 |
-
```
|
| 120 |
-
|
| 121 |
-
## Acknowledgements
|
| 122 |
-
This project is co-led by ACE Studio and StepFun.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ace_step/config.json
DELETED
|
@@ -1,35 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"_class_name": "ACEStepTransformer2DModel",
|
| 3 |
-
"_diffusers_version": "0.32.2",
|
| 4 |
-
"attention_head_dim": 128,
|
| 5 |
-
"in_channels": 8,
|
| 6 |
-
"inner_dim": 2560,
|
| 7 |
-
"lyric_encoder_vocab_size": 6693,
|
| 8 |
-
"lyric_hidden_size": 1024,
|
| 9 |
-
"max_height": 16,
|
| 10 |
-
"max_position": 32768,
|
| 11 |
-
"max_width": 32768,
|
| 12 |
-
"mlp_ratio": 2.5,
|
| 13 |
-
"num_attention_heads": 20,
|
| 14 |
-
"num_layers": 24,
|
| 15 |
-
"out_channels": 8,
|
| 16 |
-
"patch_size": [
|
| 17 |
-
16,
|
| 18 |
-
1
|
| 19 |
-
],
|
| 20 |
-
"rope_theta": 1000000.0,
|
| 21 |
-
"speaker_embedding_dim": 512,
|
| 22 |
-
"ssl_encoder_depths": [
|
| 23 |
-
8,
|
| 24 |
-
8
|
| 25 |
-
],
|
| 26 |
-
"ssl_latent_dims": [
|
| 27 |
-
1024,
|
| 28 |
-
768
|
| 29 |
-
],
|
| 30 |
-
"ssl_names": [
|
| 31 |
-
"mert",
|
| 32 |
-
"m-hubert"
|
| 33 |
-
],
|
| 34 |
-
"text_embedding_dim": 768
|
| 35 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
audio/MelBandRoformer_fp16.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:6119aef379a6c7264e0b37db65ae1e6488b8ca4a00baf56d6d244737b8488226
|
| 3 |
-
size 456479072
|
|
|
|
|
|
|
|
|
|
|
|
diffusion_models/Phantom-Wan-14B_fp8_e4m3fn.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:205c2924aadcd4e1312d6aac0b4cfba80eeea33db99419b113c10eec4810cabc
|
| 3 |
-
size 15001320640
|
|
|
|
|
|
|
|
|
|
|
|
diffusion_models/Wan2_1-I2V-14B-480p_fp8_e4m3fn_scaled_KJ.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:2ff922282cd84589702e6e8c26e083d1160bfc2b217dd44e1ae2688441dc495d
|
| 3 |
-
size 16643349018
|
|
|
|
|
|
|
|
|
|
|
|
diffusion_models/Wan2_1-InfiniteTalk-Multi_fp8_e4m3fn_scaled_KJ.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:4ded4f02f2bf312e7a68f2d75cd0c680a177aef6917c9960a1eddc34f70de26d
|
| 3 |
-
size 2712729090
|
|
|
|
|
|
|
|
|
|
|
|
diffusion_models/Wan2_1-InfiniteTalk-Single_fp8_e4m3fn_scaled_KJ.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:bd6e0e6feab8c22a482b1c4dd7c0504c215c35b507ddc3b4dcaa5d3ef539879e
|
| 3 |
-
size 2713548210
|
|
|
|
|
|
|
|
|
|
|
|
diffusion_models/Wan2_1-InfiniteTalk_Multi_Q8.gguf
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:2b9b1dc2fb0f0a351e688ad8dc7545bf90b2a2f20cd91953ac077510ef6b7bc0
|
| 3 |
-
size 2646330016
|
|
|
|
|
|
|
|
|
|
|
|
diffusion_models/Wan2_1-InfiniteTalk_Single_Q8.gguf
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:c5e251c56174995d940494ec02fdf9d36da00dffdde6827829801cd171fe8ffd
|
| 3 |
-
size 2646330016
|
|
|
|
|
|
|
|
|
|
|
|
diffusion_models/wan2.1-i2v-14b-480p-Q4_K_M.gguf
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:d91f7139acadb42ea05cdf97b311e5099f714f11fbe4d90916500e2f53cbba82
|
| 3 |
-
size 11341184384
|
|
|
|
|
|
|
|
|
|
|
|
loras/FastWan_T2V_14B_480p_lora_rank_128_bf16.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:93fe4efb5198710843de9843091e15a4a967702f62f169135b73be51884fb7d7
|
| 3 |
-
size 1253192432
|
|
|
|
|
|
|
|
|
|
|
|
loras/Wan2.2-Fun-A14B-InP-LOW-HPS2.1_resized_dynamic_avg_rank_15_bf16.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:1879ffd9ee08b533157eb04b6440673515be1ac7b4ee81648355e3bf3a59bdfd
|
| 3 |
-
size 101752852
|
|
|
|
|
|
|
|
|
|
|
|
loras/Wan21_PusaV1_LoRA_14B_rank512_bf16.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:a510b5562e05efa831127bd6a6b3aecf1c4747cffdddcc0b28f88c0667ef1694
|
| 3 |
-
size 4907437824
|
|
|
|
|
|
|
|
|
|
|
|
misc/TTS/ACE-Step-v1-3.5B/ace_step_transformer/diffusion_pytorch_model.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:e810f16728d8a2e0d1b9c3a907aac8c9a427ce38edbd890cb3dce5ff92da5aad
|
| 3 |
-
size 6611422728
|
|
|
|
|
|
|
|
|
|
|
|
misc/TTS/ACE-Step-v1-3.5B/music_dcae_f8c8/diffusion_pytorch_model.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:2b0cb469307ac50659d1880db2a99bae47d0df335cbb36853964662d4b80e8ee
|
| 3 |
-
size 313646516
|
|
|
|
|
|
|
|
|
|
|
|
misc/TTS/ACE-Step-v1-3.5B/music_vocoder/diffusion_pytorch_model.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:c92c9b46e28ab7b37b777780cf4308ad7ddac869636bb77aa61599358c4bc1c0
|
| 3 |
-
size 206350988
|
|
|
|
|
|
|
|
|
|
|
|
misc/TTS/ACE-Step-v1-3.5B/umt5-base/model.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:779cec0d210b2123e21d0a9cd8128f02b4d412627355028965a8be0b241cc3b6
|
| 3 |
-
size 1127460248
|
|
|
|
|
|
|
|
|
|
|
|
misc/ace_step/all_in_one/ace_step_v1_3.5b.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:f07cad74c4adce52ca14ca1bdf74cf3c14cbafb0823b95eca4459467fa369f40
|
| 3 |
-
size 7699743341
|
|
|
|
|
|
|
|
|
|
|
|
misc/clip_vision/clip_vision_h.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:64a7ef761bfccbadbaa3da77366aac4185a6c58fa5de5f589b42a65bcc21f161
|
| 3 |
-
size 1264219396
|
|
|
|
|
|
|
|
|
|
|
|
misc/diffusion_models/MelBandRoformer_fp16.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:6119aef379a6c7264e0b37db65ae1e6488b8ca4a00baf56d6d244737b8488226
|
| 3 |
-
size 456479072
|
|
|
|
|
|
|
|
|
|
|
|
misc/diffusion_models/Wan14BI2VFusioniX_phantom_14B_fp16.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:205c2924aadcd4e1312d6aac0b4cfba80eeea33db99419b113c10eec4810cabc
|
| 3 |
-
size 15001320640
|
|
|
|
|
|
|
|
|
|
|
|
misc/diffusion_models/Wan2_1-Fun-V1_1-14B-Control-Camera_fp8_e4m3fn.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:44fb0cd28b22e5f3fe71ec9604e1e03c83cb6b15cf0353a7f2b77bc316fafcc7
|
| 3 |
-
size 17648319713
|
|
|
|
|
|
|
|
|
|
|
|
misc/diffusion_models/Wan2_1-InfiniteTalk_Multi_Q8.gguf
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:2b9b1dc2fb0f0a351e688ad8dc7545bf90b2a2f20cd91953ac077510ef6b7bc0
|
| 3 |
-
size 2646330016
|
|
|
|
|
|
|
|
|
|
|
|
misc/diffusion_models/Wan2_1-InfiniteTalk_Single_Q8.gguf
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:c5e251c56174995d940494ec02fdf9d36da00dffdde6827829801cd171fe8ffd
|
| 3 |
-
size 2646330016
|
|
|
|
|
|
|
|
|
|
|
|
misc/diffusion_models/Wan2_1-T2V-14B_fp8_e4m3fn_scaled_KJ.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:5519e566e620037b1adb399886143991036d27d44455f41190410967a2fc130d
|
| 3 |
-
size 14526876890
|
|
|
|
|
|
|
|
|
|
|
|
misc/diffusion_models/Wan2_2-I2V-A14B-HIGH_fp8_e4m3fn_scaled_KJ.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:b3a6e732feb5fd5fa35f5e3ef612fa1f0a77dc66601fbf999d4f84a01e7120a6
|
| 3 |
-
size 15002999858
|
|
|
|
|
|
|
|
|
|
|
|
misc/diffusion_models/Wan2_2-I2V-A14B-LOW_fp8_e4m3fn_scaled_KJ.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:3338c9e672ad9e406a28b38231d6c9d94bf63ab73c3940b91428321993491bb8
|
| 3 |
-
size 15002999858
|
|
|
|
|
|
|
|
|
|
|
|
misc/diffusion_models/wan2.1-i2v-14b-480p-Q4_K_M.gguf
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:d91f7139acadb42ea05cdf97b311e5099f714f11fbe4d90916500e2f53cbba82
|
| 3 |
-
size 11341184384
|
|
|
|
|
|
|
|
|
|
|
|
misc/diffusion_models/wan2.2_fun_camera_high_noise_14B_fp8_scaled.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:c14fec6b1f1ee16acf7c6ae2feab8c2b0e909cfad15f6765d959c6dea587e0b4
|
| 3 |
-
size 15535183490
|
|
|
|
|
|
|
|
|
|
|
|
misc/diffusion_models/wan2.2_fun_camera_low_noise_14B_fp8_scaled.safetensors
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:6251dee756a4b9b26862e63491706aa68cad55999efc8299c102b54785b5f944
|
| 3 |
-
size 15535183490
|
|
|
|
|
|
|
|
|
|
|
|