Aloukik21 commited on Feb 14

Commit

a40a75c

verified ·

1 Parent(s): 3924df4

Cleanup: remove 72 unneeded files (255GB) - duplicates, old models, DiffRhythm, Infinity

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

TTS/DiffRhythm/MuQ-MuLan-large/README.md +0 -111
TTS/DiffRhythm/MuQ-MuLan-large/config.json +0 -41
TTS/DiffRhythm/MuQ-MuLan-large/pytorch_model.bin +0 -3
TTS/DiffRhythm/MuQ-large-msd-iter/README.md +0 -113
TTS/DiffRhythm/MuQ-large-msd-iter/config.json +0 -143
TTS/DiffRhythm/MuQ-large-msd-iter/model.safetensors +0 -3
TTS/DiffRhythm/MuQ-large-msd-iter/pytorch_model.bin +0 -3
TTS/DiffRhythm/cfm_model_v1_2.pt +0 -3
TTS/DiffRhythm/config.json +0 -13
TTS/DiffRhythm/vae_model.pt +0 -3
TTS/DiffRhythm/xlm-roberta-base/README.md +0 -200
TTS/DiffRhythm/xlm-roberta-base/config.json +0 -25
TTS/DiffRhythm/xlm-roberta-base/flax_model.msgpack +0 -3
TTS/DiffRhythm/xlm-roberta-base/model.onnx +0 -3
TTS/DiffRhythm/xlm-roberta-base/model.safetensors +0 -3
TTS/DiffRhythm/xlm-roberta-base/pytorch_model.bin +0 -3
TTS/DiffRhythm/xlm-roberta-base/sentencepiece.bpe.model +0 -3
TTS/DiffRhythm/xlm-roberta-base/tf_model.h5 +0 -3
TTS/DiffRhythm/xlm-roberta-base/tokenizer.json +0 -0
TTS/DiffRhythm/xlm-roberta-base/tokenizer_config.json +0 -1
ace_step/README.md +0 -122
ace_step/config.json +0 -35
audio/MelBandRoformer_fp16.safetensors +0 -3
diffusion_models/Phantom-Wan-14B_fp8_e4m3fn.safetensors +0 -3
diffusion_models/Wan2_1-I2V-14B-480p_fp8_e4m3fn_scaled_KJ.safetensors +0 -3
diffusion_models/Wan2_1-InfiniteTalk-Multi_fp8_e4m3fn_scaled_KJ.safetensors +0 -3
diffusion_models/Wan2_1-InfiniteTalk-Single_fp8_e4m3fn_scaled_KJ.safetensors +0 -3
diffusion_models/Wan2_1-InfiniteTalk_Multi_Q8.gguf +0 -3
diffusion_models/Wan2_1-InfiniteTalk_Single_Q8.gguf +0 -3
diffusion_models/wan2.1-i2v-14b-480p-Q4_K_M.gguf +0 -3
loras/FastWan_T2V_14B_480p_lora_rank_128_bf16.safetensors +0 -3
loras/Wan2.2-Fun-A14B-InP-LOW-HPS2.1_resized_dynamic_avg_rank_15_bf16.safetensors +0 -3
loras/Wan21_PusaV1_LoRA_14B_rank512_bf16.safetensors +0 -3
misc/TTS/ACE-Step-v1-3.5B/ace_step_transformer/diffusion_pytorch_model.safetensors +0 -3
misc/TTS/ACE-Step-v1-3.5B/music_dcae_f8c8/diffusion_pytorch_model.safetensors +0 -3
misc/TTS/ACE-Step-v1-3.5B/music_vocoder/diffusion_pytorch_model.safetensors +0 -3
misc/TTS/ACE-Step-v1-3.5B/umt5-base/model.safetensors +0 -3
misc/ace_step/all_in_one/ace_step_v1_3.5b.safetensors +0 -3
misc/clip_vision/clip_vision_h.safetensors +0 -3
misc/diffusion_models/MelBandRoformer_fp16.safetensors +0 -3
misc/diffusion_models/Wan14BI2VFusioniX_phantom_14B_fp16.safetensors +0 -3
misc/diffusion_models/Wan2_1-Fun-V1_1-14B-Control-Camera_fp8_e4m3fn.safetensors +0 -3
misc/diffusion_models/Wan2_1-InfiniteTalk_Multi_Q8.gguf +0 -3
misc/diffusion_models/Wan2_1-InfiniteTalk_Single_Q8.gguf +0 -3
misc/diffusion_models/Wan2_1-T2V-14B_fp8_e4m3fn_scaled_KJ.safetensors +0 -3
misc/diffusion_models/Wan2_2-I2V-A14B-HIGH_fp8_e4m3fn_scaled_KJ.safetensors +0 -3
misc/diffusion_models/Wan2_2-I2V-A14B-LOW_fp8_e4m3fn_scaled_KJ.safetensors +0 -3
misc/diffusion_models/wan2.1-i2v-14b-480p-Q4_K_M.gguf +0 -3
misc/diffusion_models/wan2.2_fun_camera_high_noise_14B_fp8_scaled.safetensors +0 -3
misc/diffusion_models/wan2.2_fun_camera_low_noise_14B_fp8_scaled.safetensors +0 -3

TTS/DiffRhythm/MuQ-MuLan-large/README.md DELETED Viewed

@@ -1,111 +0,0 @@
----
-license: cc-by-nc-4.0
-language:
-- en
-- zh
-pipeline_tag: audio-classification
-tags:
-- music
----
-# MuQ & MuQ-MuLan
-<div>
-  <a href='#'><img alt="Static Badge" src="https://img.shields.io/badge/Python-3.8%2B-blue?logo=python&logoColor=white"></a>
-  <a href='https://arxiv.org/abs/2501.01108'><img alt="Static Badge" src="https://img.shields.io/badge/arXiv-2501.01108-%23b31b1b?logo=arxiv&link=https%3A%2F%2Farxiv.org%2F"></a>
-  <a href='https://huggingface.co/OpenMuQ'><img alt="Static Badge" src="https://img.shields.io/badge/huggingface-OpenMuQ-%23FFD21E?logo=huggingface&link=https%3A%2F%2Fhuggingface.co%2FOpenMuQ"></a>
-  <a href='https://pytorch.org/'><img alt="Static Badge" src="https://img.shields.io/badge/framework-PyTorch-%23EE4C2C?logo=pytorch"></a>
-  <a href='https://pypi.org/project/muq'><img alt="Static Badge" src="https://img.shields.io/badge/pip%20install-muq-green?logo=PyPI&logoColor=white&link=https%3A%2F%2Fpypi.org%2Fproject%2Fmuq"></a>
-</div>
-This is the official repository for the paper *"**MuQ**: Self-Supervised **Mu**sic Representation Learning
- with Mel Residual Vector **Q**uantization"*. For more detailed information, we strongly recommend referring to https://github.com/tencent-ailab/MuQ and the [paper]((https://arxiv.org/abs/2501.01108)).
-In this repo, the following models are released:
-- **MuQ**(see [this link](https://huggingface.co/OpenMuQ/MuQ-large-msd-iter)): A large music foundation model pre-trained via Self-Supervised Learning (SSL), achieving SOTA in various MIR tasks.
-- **MuQ-MuLan**(see [this link](https://huggingface.co/OpenMuQ/MuQ-MuLan-large)): A music-text joint embedding model trained via contrastive learning, supporting both English and Chinese texts.
-## Usage
-To begin with, please use pip to install the official `muq` lib, and ensure that your `python>=3.8`:
-```bash
-pip3 install muq
-```
-Using **MuQ-MuLan** to extract the music and text embeddings and calculate the similarity:
-```python
-import torch, librosa
-from muq import MuQMuLan
-# This will automatically fetch checkpoints from huggingface
-device = 'cuda'
-mulan = MuQMuLan.from_pretrained("OpenMuQ/MuQ-MuLan-large")
-mulan = mulan.to(device).eval()
-# Extract music embeddings
-wav, sr = librosa.load("path/to/music_audio.wav", sr = 24000)
-wavs = torch.tensor(wav).unsqueeze(0).to(device)
-with torch.no_grad():
-    audio_embeds = mulan(wavs = wavs)
-# Extract text embeddings (texts can be in English or Chinese)
-texts = ["classical genres, hopeful mood, piano.", "一首适合海边风景的小提琴曲，节奏欢快"]
-with torch.no_grad():
-    text_embeds = mulan(texts = texts)
-# Calculate dot product similarity
-sim = mulan.calc_similarity(audio_embeds, text_embeds)
-print(sim)
-```
-To extract music audio features using **MuQ**:
-```python
-import torch, librosa
-from muq import MuQ
-device = 'cuda'
-wav, sr = librosa.load("path/to/music_audio.wav", sr = 24000)
-wavs = torch.tensor(wav).unsqueeze(0).to(device)
-# This will automatically fetch the checkpoint from huggingface
-muq = MuQ.from_pretrained("OpenMuQ/MuQ-large-msd-iter")
-muq = muq.to(device).eval()
-with torch.no_grad():
-    output = muq(wavs, output_hidden_states=True)
-print('Total number of layers: ', len(output.hidden_states))
-print('Feature shape: ', output.last_hidden_state.shape)
-```
-## Model Checkpoints
-| Model Name | Parameters | Data | HuggingFace🤗 |
-| ----------- | --- | ---  | ----------- |
-| MuQ    | ~300M  | MSD dataset | [OpenMuQ/MuQ-large-msd-iter](https://huggingface.co/OpenMuQ/MuQ-large-msd-iter)       |
-| MuQ-MuLan  | ~700M | music-text pairs | [OpenMuQ/MuQ-MuLan-large](https://huggingface.co/OpenMuQ/MuQ-MuLan-large)       |
-**Note**: Please note that the open-sourced MuQ was trained on the Million Song Dataset. Due to differences in dataset size, the open-sourced model may not achieve the same level of performance as reported in the paper.
-## License
-The code is released under the MIT license.
-The model weights (MuQ-large-msd-iter, MuQ-MuLan-large) are released under the CC-BY-NC 4.0 license.
-## Citation
-```
-@article{zhu2025muq,
-      title={MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization},
-      author={Haina Zhu and Yizhi Zhou and Hangting Chen and Jianwei Yu and Ziyang Ma and Rongzhi Gu and Yi Luo and Wei Tan and Xie Chen},
-      journal={arXiv preprint arXiv:2501.01108},
-      year={2025}
-}
-```

TTS/DiffRhythm/MuQ-MuLan-large/config.json DELETED Viewed

@@ -1,41 +0,0 @@
-{
-  "mulan": {
-    "sr": 24000,
-    "clip_secs": 10,
-    "dim_latent": 512,
-    "decoupled_contrastive_learning": true,
-    "hierarchical_contrastive_loss": false,
-    "hierarchical_contrastive_loss_layers": null,
-    "sigmoid_contrastive_loss": false,
-    "rank_contrast": true
-  },
-  "audio_model": {
-    "name": "OpenMuQ/MuQ-large-msd-iter",
-    "model_dim": 1024,
-    "use_layer_idx": -1
-  },
-  "text_model": {
-    "name": "xlm-roberta-base",
-    "model_dim": null,
-    "use_layer_idx": -1
-  },
-  "audio_transformer": {
-    "dim": 768,
-    "tf_depth": 0,
-    "heads": 8,
-    "dim_head": 64,
-    "attn_dropout": 0,
-    "ff_dropout": 0,
-    "ff_mult": 4
-  },
-  "text_transformer": {
-    "dim": 768,
-    "tf_depth": 8,
-    "max_seq_len": 1024,
-    "dim_head": 64,
-    "heads": 8,
-    "attn_dropout": 0,
-    "ff_dropout": 0,
-    "ff_mult": 4
-  }
-}

TTS/DiffRhythm/MuQ-MuLan-large/pytorch_model.bin DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:d42ae3f7cb9b66759ee0089ddc70e2f28b130c2d8ba621457358272d32dd0444
-size 2653954401

TTS/DiffRhythm/MuQ-large-msd-iter/README.md DELETED Viewed

@@ -1,113 +0,0 @@
----
-license: cc-by-nc-4.0
-language:
-- en
-- zh
-pipeline_tag: audio-classification
-tags:
-- music
----
-# MuQ & MuQ-MuLan
-<div>
-  <a href='#'><img alt="Static Badge" src="https://img.shields.io/badge/Python-3.8%2B-blue?logo=python&logoColor=white"></a>
-  <a href='https://arxiv.org/abs/2501.01108'><img alt="Static Badge" src="https://img.shields.io/badge/arXiv-2501.01108-%23b31b1b?logo=arxiv&link=https%3A%2F%2Farxiv.org%2F"></a>
-  <a href='https://huggingface.co/OpenMuQ'><img alt="Static Badge" src="https://img.shields.io/badge/huggingface-OpenMuQ-%23FFD21E?logo=huggingface&link=https%3A%2F%2Fhuggingface.co%2FOpenMuQ"></a>
-  <a href='https://pytorch.org/'><img alt="Static Badge" src="https://img.shields.io/badge/framework-PyTorch-%23EE4C2C?logo=pytorch"></a>
-  <a href='https://pypi.org/project/muq'><img alt="Static Badge" src="https://img.shields.io/badge/pip%20install-muq-green?logo=PyPI&logoColor=white&link=https%3A%2F%2Fpypi.org%2Fproject%2Fmuq"></a>
-</div>
-This is the official repository for the paper *"**MuQ**: Self-Supervised **Mu**sic Representation Learning
- with Mel Residual Vector **Q**uantization"*. For more detailed information, we strongly recommend referring to https://github.com/tencent-ailab/MuQ and the [paper]((https://arxiv.org/abs/2501.01108)).
-In this repo, the following models are released:
-- **MuQ**(see [this link](https://huggingface.co/OpenMuQ/MuQ-large-msd-iter)): A large music foundation model pre-trained via Self-Supervised Learning (SSL), achieving SOTA in various MIR tasks.
-- **MuQ-MuLan**(see [this link](https://huggingface.co/OpenMuQ/MuQ-MuLan-large)): A music-text joint embedding model trained via contrastive learning, supporting both English and Chinese texts.
-## Usage
-To begin with, please use pip to install the official `muq` lib, and ensure that your `python>=3.8`:
-```bash
-pip3 install muq
-```
-To extract music audio features using **MuQ**:
-```python
-import torch, librosa
-from muq import MuQ
-device = 'cuda'
-wav, sr = librosa.load("path/to/music_audio.wav", sr = 24000)
-wavs = torch.tensor(wav).unsqueeze(0).to(device)
-# This will automatically fetch the checkpoint from huggingface
-muq = MuQ.from_pretrained("OpenMuQ/MuQ-large-msd-iter")
-muq = muq.to(device).eval()
-with torch.no_grad():
-    output = muq(wavs, output_hidden_states=True)
-print('Total number of layers: ', len(output.hidden_states))
-print('Feature shape: ', output.last_hidden_state.shape)
-```
-Using **MuQ-MuLan** to extract the music and text embeddings and calculate the similarity:
-```python
-import torch, librosa
-from muq import MuQMuLan
-# This will automatically fetch checkpoints from huggingface
-device = 'cuda'
-mulan = MuQMuLan.from_pretrained("OpenMuQ/MuQ-MuLan-large")
-mulan = mulan.to(device).eval()
-# Extract music embeddings
-wav, sr = librosa.load("path/to/music_audio.wav", sr = 24000)
-wavs = torch.tensor(wav).unsqueeze(0).to(device)
-with torch.no_grad():
-    audio_embeds = mulan(wavs = wavs)
-# Extract text embeddings (texts can be in English or Chinese)
-texts = ["classical genres, hopeful mood, piano.", "一首适合海边风景的小提琴曲，节奏欢快"]
-with torch.no_grad():
-    text_embeds = mulan(texts = texts)
-# Calculate dot product similarity
-sim = mulan.calc_similarity(audio_embeds, text_embeds)
-print(sim)
-```
-## Model Checkpoints
-| Model Name | Parameters | Data | HuggingFace🤗 |
-| ----------- | --- | ---  | ----------- |
-| MuQ    | ~300M  | MSD dataset | [OpenMuQ/MuQ-large-msd-iter](https://huggingface.co/OpenMuQ/MuQ-large-msd-iter)       |
-| MuQ-MuLan  | ~700M | music-text pairs | [OpenMuQ/MuQ-MuLan-large](https://huggingface.co/OpenMuQ/MuQ-MuLan-large)       |
-**Note**: Please note that the open-sourced MuQ was trained on the Million Song Dataset. Due to differences in dataset size, the open-sourced model may not achieve the same level of performance as reported in the paper.
-## License
-The code is released under the MIT license.
-The model weights (MuQ-large-msd-iter, MuQ-MuLan-large) are released under the CC-BY-NC 4.0 license.
-## Citation
-```
-@article{zhu2025muq,
-      title={MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization},
-      author={Haina Zhu and Yizhi Zhou and Hangting Chen and Jianwei Yu and Ziyang Ma and Rongzhi Gu and Yi Luo and Wei Tan and Xie Chen},
-      journal={arXiv preprint arXiv:2501.01108},
-      year={2025}
-}
-```

TTS/DiffRhythm/MuQ-large-msd-iter/config.json DELETED Viewed

@@ -1,143 +0,0 @@
-{
-  "codebook_dim": 16,
-  "codebook_size": 8192,
-  "conv_dim": 512,
-  "encoder_depth": 12,
-  "encoder_dim": 1024,
-  "features": [
-    "melspec_2048"
-  ],
-  "hop_length": 240,
-  "is_flash": false,
-  "label_rate": 25,
-  "mask_hop": 0.4,
-  "mask_prob": 0.6,
-  "n_mels": 128,
-  "num_codebooks": 1,
-  "recon_loss_ratio": null,
-  "resume_checkpoint": null,
-  "rvq_ckpt_path": null,
-  "rvq_multi_layer_num": 1,
-  "rvq_n_codebooks": 8,
-  "stat": {
-    "melspec_2048_cnt": 14282760192,
-    "melspec_2048_mean": 6.768444971712967,
-    "melspec_2048_std": 18.417922652295623
-  },
-  "use_encodec_target": false,
-  "use_rvq_target": true,
-  "use_vq_target": false,
-  "w2v2_config": {
-    "activation_dropout": 0.1,
-    "adapter_kernel_size": 3,
-    "adapter_stride": 2,
-    "add_adapter": false,
-    "apply_spec_augment": true,
-    "architectures": [
-      "Wav2Vec2ConformerForCTC"
-    ],
-    "attention_dropout": 0.1,
-    "bos_token_id": 1,
-    "classifier_proj_size": 256,
-    "codevector_dim": 768,
-    "conformer_conv_dropout": 0.1,
-    "contrastive_logits_temperature": 0.1,
-    "conv_bias": true,
-    "conv_depthwise_kernel_size": 31,
-    "conv_dim": [
-      512,
-      512,
-      512,
-      512,
-      512,
-      512,
-      512
-    ],
-    "conv_kernel": [
-      10,
-      3,
-      3,
-      3,
-      3,
-      2,
-      2
-    ],
-    "conv_stride": [
-      5,
-      2,
-      2,
-      2,
-      2,
-      2,
-      2
-    ],
-    "ctc_loss_reduction": "sum",
-    "ctc_zero_infinity": false,
-    "diversity_loss_weight": 0.1,
-    "do_stable_layer_norm": true,
-    "eos_token_id": 2,
-    "feat_extract_activation": "gelu",
-    "feat_extract_dropout": 0.0,
-    "feat_extract_norm": "layer",
-    "feat_proj_dropout": 0.1,
-    "feat_quantizer_dropout": 0.0,
-    "final_dropout": 0.1,
-    "gradient_checkpointing": false,
-    "hidden_act": "swish",
-    "hidden_dropout": 0.1,
-    "hidden_dropout_prob": 0.1,
-    "hidden_size": 1024,
-    "initializer_range": 0.02,
-    "intermediate_size": 4096,
-    "layer_norm_eps": 1e-05,
-    "layerdrop": 0.0,
-    "mask_feature_length": 10,
-    "mask_feature_min_masks": 0,
-    "mask_feature_prob": 0.0,
-    "mask_time_length": 10,
-    "mask_time_min_masks": 2,
-    "mask_time_prob": 0.05,
-    "max_source_positions": 5000,
-    "model_type": "wav2vec2-conformer",
-    "num_adapter_layers": 3,
-    "num_attention_heads": 16,
-    "num_codevector_groups": 2,
-    "num_codevectors_per_group": 320,
-    "num_conv_pos_embedding_groups": 16,
-    "num_conv_pos_embeddings": 128,
-    "num_feat_extract_layers": 7,
-    "num_hidden_layers": 24,
-    "num_negatives": 100,
-    "output_hidden_size": 1024,
-    "pad_token_id": 0,
-    "position_embeddings_type": "rotary",
-    "proj_codevector_dim": 768,
-    "rotary_embedding_base": 10000,
-    "tdnn_dilation": [
-      1,
-      2,
-      3,
-      1,
-      1
-    ],
-    "tdnn_dim": [
-      512,
-      512,
-      512,
-      512,
-      1500
-    ],
-    "tdnn_kernel": [
-      5,
-      3,
-      3,
-      1,
-      1
-    ],
-    "torch_dtype": "float32",
-    "transformers_version": "4.19.0.dev0",
-    "use_weighted_layer_sum": false,
-    "vocab_size": 32,
-    "xvector_output_dim": 512
-  }
-}

TTS/DiffRhythm/MuQ-large-msd-iter/model.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:273febab2be02872c37d2c37e48a9d6c52c1c9392f3eeeabd498efa281ccb7a6
-size 1333825096

TTS/DiffRhythm/MuQ-large-msd-iter/pytorch_model.bin DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:334df3de2832ec1acfd8b6ce54e7de4073401fe821f7ec0ad0d954832be2d26a
-size 1333965438

TTS/DiffRhythm/cfm_model_v1_2.pt DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:3e819b317ce2cf1fb22f386d74f351b697204ec1f57f03edfe50dbca71cf0768
-size 2218709125

TTS/DiffRhythm/config.json DELETED Viewed

@@ -1,13 +0,0 @@
-{
-    "model_type": "diffrhythm",
-    "model": {
-        "dim": 2048,
-        "depth": 16,
-        "heads": 32,
-        "ff_mult": 4,
-        "text_dim": 512,
-        "conv_layers": 4,
-        "mel_dim": 64,
-        "text_num_embeds": 363
-    }
-}

TTS/DiffRhythm/vae_model.pt DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:712693f27299937c6ccf1a6d6f1d9b45c7c8c11210d3b0cbb0f36181465ba29f
-size 624520127

TTS/DiffRhythm/xlm-roberta-base/README.md DELETED Viewed

@@ -1,200 +0,0 @@
----
-tags:
-- exbert
-language:
-- multilingual
-- af
-- am
-- ar
-- as
-- az
-- be
-- bg
-- bn
-- br
-- bs
-- ca
-- cs
-- cy
-- da
-- de
-- el
-- en
-- eo
-- es
-- et
-- eu
-- fa
-- fi
-- fr
-- fy
-- ga
-- gd
-- gl
-- gu
-- ha
-- he
-- hi
-- hr
-- hu
-- hy
-- id
-- is
-- it
-- ja
-- jv
-- ka
-- kk
-- km
-- kn
-- ko
-- ku
-- ky
-- la
-- lo
-- lt
-- lv
-- mg
-- mk
-- ml
-- mn
-- mr
-- ms
-- my
-- ne
-- nl
-- no
-- om
-- or
-- pa
-- pl
-- ps
-- pt
-- ro
-- ru
-- sa
-- sd
-- si
-- sk
-- sl
-- so
-- sq
-- sr
-- su
-- sv
-- sw
-- ta
-- te
-- th
-- tl
-- tr
-- ug
-- uk
-- ur
-- uz
-- vi
-- xh
-- yi
-- zh
-license: mit
----
-# XLM-RoBERTa (base-sized model)
-XLM-RoBERTa model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages. It was introduced in the paper [Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116) by Conneau et al. and first released in [this repository](https://github.com/pytorch/fairseq/tree/master/examples/xlmr).
-Disclaimer: The team releasing XLM-RoBERTa did not write a model card for this model so this model card has been written by the Hugging Face team.
-## Model description
-XLM-RoBERTa is a multilingual version of RoBERTa. It is pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages.
-RoBERTa is a transformers model pretrained on a large corpus in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts.
-More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
-This way, the model learns an inner representation of 100 languages that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the XLM-RoBERTa model as inputs.
-## Intended uses & limitations
-You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?search=xlm-roberta) to look for fine-tuned versions on a task that interests you.
-Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation, you should look at models like GPT2.
-## Usage
-You can use this model directly with a pipeline for masked language modeling:
-```python
->>> from transformers import pipeline
->>> unmasker = pipeline('fill-mask', model='xlm-roberta-base')
->>> unmasker("Hello I'm a <mask> model.")
-[{'score': 0.10563907772302628,
-  'sequence': "Hello I'm a fashion model.",
-  'token': 54543,
-  'token_str': 'fashion'},
- {'score': 0.08015287667512894,
-  'sequence': "Hello I'm a new model.",
-  'token': 3525,
-  'token_str': 'new'},
- {'score': 0.033413201570510864,
-  'sequence': "Hello I'm a model model.",
-  'token': 3299,
-  'token_str': 'model'},
- {'score': 0.030217764899134636,
-  'sequence': "Hello I'm a French model.",
-  'token': 92265,
-  'token_str': 'French'},
- {'score': 0.026436051353812218,
-  'sequence': "Hello I'm a sexy model.",
-  'token': 17473,
-  'token_str': 'sexy'}]
-```
-Here is how to use this model to get the features of a given text in PyTorch:
-```python
-from transformers import AutoTokenizer, AutoModelForMaskedLM
-tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')
-model = AutoModelForMaskedLM.from_pretrained("xlm-roberta-base")
-# prepare input
-text = "Replace me by any text you'd like."
-encoded_input = tokenizer(text, return_tensors='pt')
-# forward pass
-output = model(**encoded_input)
-```
-### BibTeX entry and citation info
-```bibtex
-@article{DBLP:journals/corr/abs-1911-02116,
-  author    = {Alexis Conneau and
-               Kartikay Khandelwal and
-               Naman Goyal and
-               Vishrav Chaudhary and
-               Guillaume Wenzek and
-               Francisco Guzm{\'{a}}n and
-               Edouard Grave and
-               Myle Ott and
-               Luke Zettlemoyer and
-               Veselin Stoyanov},
-  title     = {Unsupervised Cross-lingual Representation Learning at Scale},
-  journal   = {CoRR},
-  volume    = {abs/1911.02116},
-  year      = {2019},
-  url       = {http://arxiv.org/abs/1911.02116},
-  eprinttype = {arXiv},
-  eprint    = {1911.02116},
-  timestamp = {Mon, 11 Nov 2019 18:38:09 +0100},
-  biburl    = {https://dblp.org/rec/journals/corr/abs-1911-02116.bib},
-  bibsource = {dblp computer science bibliography, https://dblp.org}
-}
-```
-<a href="https://huggingface.co/exbert/?model=xlm-roberta-base">
-	<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
-</a>

TTS/DiffRhythm/xlm-roberta-base/config.json DELETED Viewed

@@ -1,25 +0,0 @@
-{
-  "architectures": [
-    "XLMRobertaForMaskedLM"
-  ],
-  "attention_probs_dropout_prob": 0.1,
-  "bos_token_id": 0,
-  "eos_token_id": 2,
-  "hidden_act": "gelu",
-  "hidden_dropout_prob": 0.1,
-  "hidden_size": 768,
-  "initializer_range": 0.02,
-  "intermediate_size": 3072,
-  "layer_norm_eps": 1e-05,
-  "max_position_embeddings": 514,
-  "model_type": "xlm-roberta",
-  "num_attention_heads": 12,
-  "num_hidden_layers": 12,
-  "output_past": true,
-  "pad_token_id": 1,
-  "position_embedding_type": "absolute",
-  "transformers_version": "4.17.0.dev0",
-  "type_vocab_size": 1,
-  "use_cache": true,
-  "vocab_size": 250002
-}

TTS/DiffRhythm/xlm-roberta-base/flax_model.msgpack DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:311b6941e02128b01c6a429f55b47b351a86fe53e6802774d87696bcbc465992
-size 1113187999

TTS/DiffRhythm/xlm-roberta-base/model.onnx DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:a76bfe6a405f1a9ace42b2dbd8fbd284dd8127a732ddcf2145b0fc9413b30d40
-size 1881470773

TTS/DiffRhythm/xlm-roberta-base/model.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:6fd4797bc397c3b8b55d6bb5740366b57e6a3ce91c04c77f22aafc0c128e6feb
-size 1115567652

TTS/DiffRhythm/xlm-roberta-base/pytorch_model.bin DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:9d83baaafea92d36de26002c8135a427d55ee6fdc4faaa6e400be4c47724a07e
-size 1115590446

TTS/DiffRhythm/xlm-roberta-base/sentencepiece.bpe.model DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
-size 5069051

TTS/DiffRhythm/xlm-roberta-base/tf_model.h5 DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:d1232fb4018ab3a236c29f10aefd190ef844ad994ac74820d9532637bd87b3f4
-size 1112441536

TTS/DiffRhythm/xlm-roberta-base/tokenizer.json DELETED Viewed

The diff for this file is too large to render. See raw diff

TTS/DiffRhythm/xlm-roberta-base/tokenizer_config.json DELETED Viewed

	@@ -1 +0,0 @@
1	- {"model_max_length": 512}

ace_step/README.md DELETED Viewed

@@ -1,122 +0,0 @@
----
-license: apache-2.0
-tags:
-- music
-- text2music
-- acestep
-pipeline_tag: text-to-audio
-language:
-- en
-- zh
-- de
-- fr
-- es
-- it
-- pt
-- pl
-- tr
-- ru
-- cs
-- nl
-- ar
-- ja
-- hu
-- ko
-- hi
----
-# ACE-Step: A Step Towards Music Generation Foundation Model
-![ACE-Step Framework](https://github.com/ACE-Step/ACE-Step/raw/main/assets/ACE-Step_framework.png)
-## Model Description
-ACE-Step is a novel open-source foundation model for music generation that overcomes key limitations of existing approaches through a holistic architectural design. It integrates diffusion-based generation with Sana's Deep Compression AutoEncoder (DCAE) and a lightweight linear transformer, achieving state-of-the-art performance in generation speed, musical coherence, and controllability.
-**Key Features:**
-- 15× faster than LLM-based baselines (20s for 4-minute music on A100)
-- Superior musical coherence across melody, harmony, and rhythm
-- full-song generation, duration control and accepts natural language descriptions
-## Uses
-### Direct Use
-ACE-Step can be used for:
-- Generating original music from text descriptions
-- Music remixing and style transfer
-- edit song lyrics
-### Downstream Use
-The model serves as a foundation for:
-- Voice cloning applications
-- Specialized music generation (rap, jazz, etc.)
-- Music production tools
-- Creative AI assistants
-### Out-of-Scope Use
-The model should not be used for:
-- Generating copyrighted content without permission
-- Creating harmful or offensive content
-- Misrepresenting AI-generated music as human-created
-## How to Get Started
-see: https://github.com/ace-step/ACE-Step
-## Hardware Performance
-| Device        | 27 Steps | 60 Steps |
-|---------------|----------|----------|
-| NVIDIA A100   | 27.27x   | 12.27x   |
-| RTX 4090      | 34.48x   | 15.63x   |
-| RTX 3090      | 12.76x   | 6.48x    |
-| M2 Max        | 2.27x    | 1.03x    |
-*RTF (Real-Time Factor) shown - higher values indicate faster generation*
-## Limitations
-- Performance varies by language (top 10 languages perform best)
-- Longer generations (>5 minutes) may lose structural coherence
-- Rare instruments may not render perfectly
-- Output Inconsistency: Highly sensitive to random seeds and input duration, leading to varied "gacha-style" results.
-- Style-specific Weaknesses: Underperforms on certain genres (e.g. Chinese rap/zh_rap) Limited style adherence and musicality ceiling
-- Continuity Artifacts: Unnatural transitions in repainting/extend operations
-- Vocal Quality: Coarse vocal synthesis lacking nuance
-- Control Granularity: Needs finer-grained musical parameter control
-## Ethical Considerations
-Users should:
-- Verify originality of generated works
-- Disclose AI involvement
-- Respect cultural elements and copyrights
-- Avoid harmful content generation
-## Model Details
-**Developed by:** ACE Studio and StepFun
-**Model type:** Diffusion-based music generation with transformer conditioning
-**License:** Apache 2.0
-**Resources:**
-- [Project Page](https://ace-step.github.io/)
-- [Demo Space](https://huggingface.co/spaces/ACE-Step/ACE-Step)
-- [GitHub Repository](https://github.com/ACE-Step/ACE-Step)
-## Citation
-```bibtex
-@misc{gong2025acestep,
-  title={ACE-Step: A Step Towards Music Generation Foundation Model},
-  author={Junmin Gong, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo},
-  howpublished={\url{https://github.com/ace-step/ACE-Step}},
-  year={2025},
-  note={GitHub repository}
-}
-```
-## Acknowledgements
-This project is co-led by ACE Studio and StepFun.

ace_step/config.json DELETED Viewed

@@ -1,35 +0,0 @@
-{
-  "_class_name": "ACEStepTransformer2DModel",
-  "_diffusers_version": "0.32.2",
-  "attention_head_dim": 128,
-  "in_channels": 8,
-  "inner_dim": 2560,
-  "lyric_encoder_vocab_size": 6693,
-  "lyric_hidden_size": 1024,
-  "max_height": 16,
-  "max_position": 32768,
-  "max_width": 32768,
-  "mlp_ratio": 2.5,
-  "num_attention_heads": 20,
-  "num_layers": 24,
-  "out_channels": 8,
-  "patch_size": [
-    16,
-    1
-  ],
-  "rope_theta": 1000000.0,
-  "speaker_embedding_dim": 512,
-  "ssl_encoder_depths": [
-    8,
-    8
-  ],
-  "ssl_latent_dims": [
-    1024,
-    768
-  ],
-  "ssl_names": [
-    "mert",
-    "m-hubert"
-  ],
-  "text_embedding_dim": 768
-}

audio/MelBandRoformer_fp16.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:6119aef379a6c7264e0b37db65ae1e6488b8ca4a00baf56d6d244737b8488226
-size 456479072

diffusion_models/Phantom-Wan-14B_fp8_e4m3fn.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:205c2924aadcd4e1312d6aac0b4cfba80eeea33db99419b113c10eec4810cabc
-size 15001320640

diffusion_models/Wan2_1-I2V-14B-480p_fp8_e4m3fn_scaled_KJ.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:2ff922282cd84589702e6e8c26e083d1160bfc2b217dd44e1ae2688441dc495d
-size 16643349018

diffusion_models/Wan2_1-InfiniteTalk-Multi_fp8_e4m3fn_scaled_KJ.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:4ded4f02f2bf312e7a68f2d75cd0c680a177aef6917c9960a1eddc34f70de26d
-size 2712729090

diffusion_models/Wan2_1-InfiniteTalk-Single_fp8_e4m3fn_scaled_KJ.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:bd6e0e6feab8c22a482b1c4dd7c0504c215c35b507ddc3b4dcaa5d3ef539879e
-size 2713548210

diffusion_models/Wan2_1-InfiniteTalk_Multi_Q8.gguf DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:2b9b1dc2fb0f0a351e688ad8dc7545bf90b2a2f20cd91953ac077510ef6b7bc0
-size 2646330016

diffusion_models/Wan2_1-InfiniteTalk_Single_Q8.gguf DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:c5e251c56174995d940494ec02fdf9d36da00dffdde6827829801cd171fe8ffd
-size 2646330016

diffusion_models/wan2.1-i2v-14b-480p-Q4_K_M.gguf DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:d91f7139acadb42ea05cdf97b311e5099f714f11fbe4d90916500e2f53cbba82
-size 11341184384

loras/FastWan_T2V_14B_480p_lora_rank_128_bf16.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:93fe4efb5198710843de9843091e15a4a967702f62f169135b73be51884fb7d7
-size 1253192432

loras/Wan2.2-Fun-A14B-InP-LOW-HPS2.1_resized_dynamic_avg_rank_15_bf16.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:1879ffd9ee08b533157eb04b6440673515be1ac7b4ee81648355e3bf3a59bdfd
-size 101752852

loras/Wan21_PusaV1_LoRA_14B_rank512_bf16.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:a510b5562e05efa831127bd6a6b3aecf1c4747cffdddcc0b28f88c0667ef1694
-size 4907437824

misc/TTS/ACE-Step-v1-3.5B/ace_step_transformer/diffusion_pytorch_model.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:e810f16728d8a2e0d1b9c3a907aac8c9a427ce38edbd890cb3dce5ff92da5aad
-size 6611422728

misc/TTS/ACE-Step-v1-3.5B/music_dcae_f8c8/diffusion_pytorch_model.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:2b0cb469307ac50659d1880db2a99bae47d0df335cbb36853964662d4b80e8ee
-size 313646516

misc/TTS/ACE-Step-v1-3.5B/music_vocoder/diffusion_pytorch_model.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:c92c9b46e28ab7b37b777780cf4308ad7ddac869636bb77aa61599358c4bc1c0
-size 206350988

misc/TTS/ACE-Step-v1-3.5B/umt5-base/model.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:779cec0d210b2123e21d0a9cd8128f02b4d412627355028965a8be0b241cc3b6
-size 1127460248

misc/ace_step/all_in_one/ace_step_v1_3.5b.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:f07cad74c4adce52ca14ca1bdf74cf3c14cbafb0823b95eca4459467fa369f40
-size 7699743341

misc/clip_vision/clip_vision_h.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:64a7ef761bfccbadbaa3da77366aac4185a6c58fa5de5f589b42a65bcc21f161
-size 1264219396

misc/diffusion_models/MelBandRoformer_fp16.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:6119aef379a6c7264e0b37db65ae1e6488b8ca4a00baf56d6d244737b8488226
-size 456479072

misc/diffusion_models/Wan14BI2VFusioniX_phantom_14B_fp16.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:205c2924aadcd4e1312d6aac0b4cfba80eeea33db99419b113c10eec4810cabc
-size 15001320640

misc/diffusion_models/Wan2_1-Fun-V1_1-14B-Control-Camera_fp8_e4m3fn.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:44fb0cd28b22e5f3fe71ec9604e1e03c83cb6b15cf0353a7f2b77bc316fafcc7
-size 17648319713

misc/diffusion_models/Wan2_1-InfiniteTalk_Multi_Q8.gguf DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:2b9b1dc2fb0f0a351e688ad8dc7545bf90b2a2f20cd91953ac077510ef6b7bc0
-size 2646330016

misc/diffusion_models/Wan2_1-InfiniteTalk_Single_Q8.gguf DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:c5e251c56174995d940494ec02fdf9d36da00dffdde6827829801cd171fe8ffd
-size 2646330016

misc/diffusion_models/Wan2_1-T2V-14B_fp8_e4m3fn_scaled_KJ.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:5519e566e620037b1adb399886143991036d27d44455f41190410967a2fc130d
-size 14526876890

misc/diffusion_models/Wan2_2-I2V-A14B-HIGH_fp8_e4m3fn_scaled_KJ.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:b3a6e732feb5fd5fa35f5e3ef612fa1f0a77dc66601fbf999d4f84a01e7120a6
-size 15002999858

misc/diffusion_models/Wan2_2-I2V-A14B-LOW_fp8_e4m3fn_scaled_KJ.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:3338c9e672ad9e406a28b38231d6c9d94bf63ab73c3940b91428321993491bb8
-size 15002999858

misc/diffusion_models/wan2.1-i2v-14b-480p-Q4_K_M.gguf DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:d91f7139acadb42ea05cdf97b311e5099f714f11fbe4d90916500e2f53cbba82
-size 11341184384

misc/diffusion_models/wan2.2_fun_camera_high_noise_14B_fp8_scaled.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:c14fec6b1f1ee16acf7c6ae2feab8c2b0e909cfad15f6765d959c6dea587e0b4
-size 15535183490

misc/diffusion_models/wan2.2_fun_camera_low_noise_14B_fp8_scaled.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:6251dee756a4b9b26862e63491706aa68cad55999efc8299c102b54785b5f944
-size 15535183490