mlx-contentvec / README.md
lexandstuff's picture
Upload README.md with huggingface_hub
6c6cd43 verified
---
license: mit
library_name: mlx
tags:
- mlx
- audio
- speech
- feature-extraction
- contentvec
- hubert
- voice-conversion
- rvc
datasets:
- librispeech_asr
language:
- en
pipeline_tag: feature-extraction
---
# MLX ContentVec / HuBERT Base
MLX-converted weights for ContentVec/HuBERT base model, optimized for Apple Silicon.
This model extracts speaker-agnostic semantic features from audio, primarily used as the feature extraction backbone for [RVC (Retrieval-based Voice Conversion)](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI).
## Model Details
- **Architecture**: HuBERT Base (12 transformer layers)
- **Parameters**: ~90M
- **Input**: 16kHz mono audio
- **Output**: 768-dimensional features (~50 frames/second)
- **Framework**: [MLX](https://github.com/ml-explore/mlx)
- **Format**: SafeTensors (float32)
## Usage
```python
import mlx.core as mx
import librosa
from mlx_contentvec import ContentvecModel
# Load model
model = ContentvecModel(encoder_layers_1=0)
model.load_weights("contentvec_base.safetensors")
model.eval()
# Load audio at 16kHz
audio, sr = librosa.load("input.wav", sr=16000, mono=True)
source = mx.array(audio).reshape(1, -1)
# Extract features
result = model(source)
features = result["x"] # Shape: (1, num_frames, 768)
```
## Installation
```bash
pip install git+https://github.com/example/mlx-contentvec.git
```
## Download Weights
```python
from huggingface_hub import hf_hub_download
weights_path = hf_hub_download(
repo_id="lexandstuff/mlx-contentvec",
filename="contentvec_base.safetensors"
)
```
## Validation
These weights produce **numerically identical** outputs to the original PyTorch implementation:
| Metric | Value |
|--------|-------|
| Max absolute difference | 7.3e-6 |
| Cosine similarity | 1.000000 |
## Source Weights
Converted from [hubert_base.pt](https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/hubert_base.pt) (MD5: `b76f784c1958d4e535cd0f6151ca35e4`).
## Use Cases
- **Voice Conversion**: Feature extraction for RVC pipeline
- **Speaker Verification**: Content-based audio embeddings
- **Speech Analysis**: Semantic feature extraction
## Citation
```bibtex
@inproceedings{qian2022contentvec,
title={ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers},
author={Qian, Kaizhi and Zhang, Yang and Gao, Heting and Ni, Junrui and Lai, Cheng-I and Cox, David and Hasegawa-Johnson, Mark and Chang, Shiyu},
booktitle={International Conference on Machine Learning},
year={2022}
}
@article{hsu2021hubert,
title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
author={Hsu, Wei-Ning and others},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
year={2021}
}
```
## License
MIT