lexandstuff
/

mlx-contentvec

+---
+license: mit
+library_name: mlx
+tags:
+  - mlx
+  - audio
+  - speech
+  - feature-extraction
+  - contentvec
+  - hubert
+  - voice-conversion
+  - rvc
+datasets:
+  - librispeech_asr
+language:
+  - en
+pipeline_tag: feature-extraction
+---
+# MLX ContentVec / HuBERT Base
+MLX-converted weights for ContentVec/HuBERT base model, optimized for Apple Silicon.
+This model extracts speaker-agnostic semantic features from audio, primarily used as the feature extraction backbone for [RVC (Retrieval-based Voice Conversion)](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI).
+## Model Details
+- **Architecture**: HuBERT Base (12 transformer layers)
+- **Parameters**: ~90M
+- **Input**: 16kHz mono audio
+- **Output**: 768-dimensional features (~50 frames/second)
+- **Framework**: [MLX](https://github.com/ml-explore/mlx)
+- **Format**: SafeTensors (float32)
+## Usage
+```python
+import mlx.core as mx
+import librosa
+from mlx_contentvec import ContentvecModel
+# Load model
+model = ContentvecModel(encoder_layers_1=0)
+model.load_weights("contentvec_base.safetensors")
+model.eval()
+# Load audio at 16kHz
+audio, sr = librosa.load("input.wav", sr=16000, mono=True)
+source = mx.array(audio).reshape(1, -1)
+# Extract features
+result = model(source)
+features = result["x"]  # Shape: (1, num_frames, 768)
+```
+## Installation
+```bash
+pip install git+https://github.com/example/mlx-contentvec.git
+```
+## Download Weights
+```python
+from huggingface_hub import hf_hub_download
+weights_path = hf_hub_download(
+    repo_id="lexandstuff/mlx-contentvec",
+    filename="contentvec_base.safetensors"
+)
+```
+## Validation
+These weights produce **numerically identical** outputs to the original PyTorch implementation:
+| Metric | Value |
+|--------|-------|
+| Max absolute difference | 7.3e-6 |
+| Cosine similarity | 1.000000 |
+## Source Weights
+Converted from [hubert_base.pt](https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/hubert_base.pt) (MD5: `b76f784c1958d4e535cd0f6151ca35e4`).
+## Use Cases
+- **Voice Conversion**: Feature extraction for RVC pipeline
+- **Speaker Verification**: Content-based audio embeddings
+- **Speech Analysis**: Semantic feature extraction
+## Citation
+```bibtex
+@inproceedings{qian2022contentvec,
+  title={ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers},
+  author={Qian, Kaizhi and Zhang, Yang and Gao, Heting and Ni, Junrui and Lai, Cheng-I and Cox, David and Hasegawa-Johnson, Mark and Chang, Shiyu},
+  booktitle={International Conference on Machine Learning},
+  year={2022}
+}
+@article{hsu2021hubert,
+  title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
+  author={Hsu, Wei-Ning and others},
+  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
+  year={2021}
+}
+```
+## License
+MIT