Upload folder using huggingface_hub

Browse files

Files changed (5) hide show

README.md +113 -0
v2/config.json +71 -0
v2/f0G32k.safetensors +3 -0
v2/f0G40k.safetensors +3 -0
v2/f0G48k.safetensors +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,113 @@

+---
+license: mit
+library_name: mlx
+tags:
+  - mlx
+  - voice-conversion
+  - rvc
+  - apple-silicon
+  - audio
+  - speech
+---
+# RVC-MLX Pretrained Weights
+MLX-compatible pretrained weights for [RVC (Retrieval-based Voice Conversion)](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI), converted for use with [rvc-mlx](https://github.com/lucasnewman/rvc-mlx).
+These weights enable high-quality voice conversion on Apple Silicon Macs using the MLX framework.
+## Available Models
+| File | Sample Rate | Size | Description |
+|------|-------------|------|-------------|
+| `v2/f0G48k.safetensors` | 48 kHz | 110 MB | V2 with F0 (pitch) - highest quality |
+| `v2/f0G40k.safetensors` | 40 kHz | 105 MB | V2 with F0 (pitch) |
+| `v2/f0G32k.safetensors` | 32 kHz | 107 MB | V2 with F0 (pitch) |
+All models use:
+- **Architecture**: SynthesizerTrnMs768NSFsid
+- **Input**: 768-dim ContentVec features
+- **F0 Support**: Yes (pitch-aware synthesis)
+## Quick Start
+```python
+from huggingface_hub import hf_hub_download
+# Download the 48kHz model
+weights_path = hf_hub_download(
+    repo_id="lexandstuff/rvc-mlx-weights",
+    filename="v2/f0G48k.safetensors"
+)
+# Download config
+config_path = hf_hub_download(
+    repo_id="lexandstuff/rvc-mlx-weights",
+    filename="v2/config.json"
+)
+```
+## Usage with rvc-mlx
+```python
+import json
+from safetensors.numpy import load_file
+from rvc_mlx.models import SynthesizerTrnMs768NSFsid
+# Load config
+with open(config_path) as f:
+    configs = json.load(f)
+    config = configs["48000"]  # or "40000", "32000"
+# Create model
+model = SynthesizerTrnMs768NSFsid(**config)
+# Load weights
+weights = load_file(weights_path)
+# ... load weights into model
+```
+## Model Details
+These are **inference-only** weights - training components (posterior encoder) have been removed to reduce file size.
+### Architecture
+```
+SynthesizerTrnMs768NSFsid
+├── enc_p (TextEncoder)      - Encodes ContentVec + pitch
+├── flow (ResidualCoupling)  - Normalizing flow for voice conversion
+├── dec (GeneratorNSF)       - HiFi-GAN vocoder with neural source filter
+└── emb_g (Embedding)        - Speaker embedding
+```
+### Upsampling Rates
+| Sample Rate | Upsample Rates | Total Factor |
+|-------------|----------------|--------------|
+| 32 kHz | [10, 8, 2, 2] | 320x |
+| 40 kHz | [10, 10, 2, 2] | 400x |
+| 48 kHz | [12, 10, 2, 2] | 480x |
+## Original Source
+These weights are converted from the official RVC pretrained models:
+- **Source**: [lj1995/VoiceConversionWebUI](https://huggingface.co/lj1995/VoiceConversionWebUI)
+- **Files**: `pretrained_v2/f0G{32k,40k,48k}.pth`
+## License
+MIT License - same as the original RVC project.
+## Citation
+If you use these weights, please cite the original RVC project:
+```bibtex
+@software{rvc2023,
+  author = {RVC-Project},
+  title = {Retrieval-based-Voice-Conversion-WebUI},
+  year = {2023},
+  url = {https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI}
+}
+```

v2/config.json ADDED Viewed

	@@ -0,0 +1,71 @@

+{
+  "32000": {
+    "model_type": "SynthesizerTrnMs768NSFsid",
+    "version": "v2",
+    "sample_rate": 32000,
+    "f0": true,
+    "spec_channels": 1025,
+    "segment_size": 32,
+    "inter_channels": 192,
+    "hidden_channels": 192,
+    "filter_channels": 768,
+    "n_heads": 2,
+    "n_layers": 6,
+    "kernel_size": 3,
+    "p_dropout": 0,
+    "resblock": "1",
+    "resblock_kernel_sizes": [3, 7, 11],
+    "resblock_dilation_sizes": [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
+    "upsample_rates": [10, 8, 2, 2],
+    "upsample_initial_channel": 512,
+    "upsample_kernel_sizes": [20, 16, 4, 4],
+    "spk_embed_dim": 109,
+    "gin_channels": 256
+  },
+  "40000": {
+    "model_type": "SynthesizerTrnMs768NSFsid",
+    "version": "v2",
+    "sample_rate": 40000,
+    "f0": true,
+    "spec_channels": 1025,
+    "segment_size": 32,
+    "inter_channels": 192,
+    "hidden_channels": 192,
+    "filter_channels": 768,
+    "n_heads": 2,
+    "n_layers": 6,
+    "kernel_size": 3,
+    "p_dropout": 0,
+    "resblock": "1",
+    "resblock_kernel_sizes": [3, 7, 11],
+    "resblock_dilation_sizes": [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
+    "upsample_rates": [10, 10, 2, 2],
+    "upsample_initial_channel": 512,
+    "upsample_kernel_sizes": [20, 20, 4, 4],
+    "spk_embed_dim": 109,
+    "gin_channels": 256
+  },
+  "48000": {
+    "model_type": "SynthesizerTrnMs768NSFsid",
+    "version": "v2",
+    "sample_rate": 48000,
+    "f0": true,
+    "spec_channels": 1025,
+    "segment_size": 32,
+    "inter_channels": 192,
+    "hidden_channels": 192,
+    "filter_channels": 768,
+    "n_heads": 2,
+    "n_layers": 6,
+    "kernel_size": 3,
+    "p_dropout": 0,
+    "resblock": "1",
+    "resblock_kernel_sizes": [3, 7, 11],
+    "resblock_dilation_sizes": [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
+    "upsample_rates": [12, 10, 2, 2],
+    "upsample_initial_channel": 512,
+    "upsample_kernel_sizes": [24, 20, 4, 4],
+    "spk_embed_dim": 109,
+    "gin_channels": 256
+  }
+}

v2/f0G32k.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4579a923bb57bad01ea9225a5fb6b641cd85c99c230e901e0d50705cfc3e6f05
+size 112277704

v2/f0G40k.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ccc4dda2fbbe8ec4ad0b614aa004934bfe54087b09fdcdd5cc86c082594a41fb
+size 110196928

v2/f0G48k.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c51f93b025a70a7cea32432de1518d1842fec4336772066cc7be2c981189ba24
+size 114915560