|
|
--- |
|
|
license: mit |
|
|
library_name: mlx |
|
|
tags: |
|
|
- mlx |
|
|
- audio |
|
|
- speech |
|
|
- feature-extraction |
|
|
- contentvec |
|
|
- hubert |
|
|
- voice-conversion |
|
|
- rvc |
|
|
datasets: |
|
|
- librispeech_asr |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: feature-extraction |
|
|
--- |
|
|
|
|
|
# MLX ContentVec / HuBERT Base |
|
|
|
|
|
MLX-converted weights for ContentVec/HuBERT base model, optimized for Apple Silicon. |
|
|
|
|
|
This model extracts speaker-agnostic semantic features from audio, primarily used as the feature extraction backbone for [RVC (Retrieval-based Voice Conversion)](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI). |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Architecture**: HuBERT Base (12 transformer layers) |
|
|
- **Parameters**: ~90M |
|
|
- **Input**: 16kHz mono audio |
|
|
- **Output**: 768-dimensional features (~50 frames/second) |
|
|
- **Framework**: [MLX](https://github.com/ml-explore/mlx) |
|
|
- **Format**: SafeTensors (float32) |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
import mlx.core as mx |
|
|
import librosa |
|
|
from mlx_contentvec import ContentvecModel |
|
|
|
|
|
# Load model |
|
|
model = ContentvecModel(encoder_layers_1=0) |
|
|
model.load_weights("contentvec_base.safetensors") |
|
|
model.eval() |
|
|
|
|
|
# Load audio at 16kHz |
|
|
audio, sr = librosa.load("input.wav", sr=16000, mono=True) |
|
|
source = mx.array(audio).reshape(1, -1) |
|
|
|
|
|
# Extract features |
|
|
result = model(source) |
|
|
features = result["x"] # Shape: (1, num_frames, 768) |
|
|
``` |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install git+https://github.com/example/mlx-contentvec.git |
|
|
``` |
|
|
|
|
|
## Download Weights |
|
|
|
|
|
```python |
|
|
from huggingface_hub import hf_hub_download |
|
|
|
|
|
weights_path = hf_hub_download( |
|
|
repo_id="lexandstuff/mlx-contentvec", |
|
|
filename="contentvec_base.safetensors" |
|
|
) |
|
|
``` |
|
|
|
|
|
## Validation |
|
|
|
|
|
These weights produce **numerically identical** outputs to the original PyTorch implementation: |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| Max absolute difference | 7.3e-6 | |
|
|
| Cosine similarity | 1.000000 | |
|
|
|
|
|
## Source Weights |
|
|
|
|
|
Converted from [hubert_base.pt](https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/hubert_base.pt) (MD5: `b76f784c1958d4e535cd0f6151ca35e4`). |
|
|
|
|
|
## Use Cases |
|
|
|
|
|
- **Voice Conversion**: Feature extraction for RVC pipeline |
|
|
- **Speaker Verification**: Content-based audio embeddings |
|
|
- **Speech Analysis**: Semantic feature extraction |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{qian2022contentvec, |
|
|
title={ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers}, |
|
|
author={Qian, Kaizhi and Zhang, Yang and Gao, Heting and Ni, Junrui and Lai, Cheng-I and Cox, David and Hasegawa-Johnson, Mark and Chang, Shiyu}, |
|
|
booktitle={International Conference on Machine Learning}, |
|
|
year={2022} |
|
|
} |
|
|
|
|
|
@article{hsu2021hubert, |
|
|
title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units}, |
|
|
author={Hsu, Wei-Ning and others}, |
|
|
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, |
|
|
year={2021} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
MIT |
|
|
|