MLX ContentVec / HuBERT Base
MLX-converted weights for ContentVec/HuBERT base model, optimized for Apple Silicon.
This model extracts speaker-agnostic semantic features from audio, primarily used as the feature extraction backbone for RVC (Retrieval-based Voice Conversion).
Model Details
- Architecture: HuBERT Base (12 transformer layers)
- Parameters: ~90M
- Input: 16kHz mono audio
- Output: 768-dimensional features (~50 frames/second)
- Framework: MLX
- Format: SafeTensors (float32)
Usage
import mlx.core as mx
import librosa
from mlx_contentvec import ContentvecModel
# Load model
model = ContentvecModel(encoder_layers_1=0)
model.load_weights("contentvec_base.safetensors")
model.eval()
# Load audio at 16kHz
audio, sr = librosa.load("input.wav", sr=16000, mono=True)
source = mx.array(audio).reshape(1, -1)
# Extract features
result = model(source)
features = result["x"] # Shape: (1, num_frames, 768)
Installation
pip install git+https://github.com/example/mlx-contentvec.git
Download Weights
from huggingface_hub import hf_hub_download
weights_path = hf_hub_download(
repo_id="lexandstuff/mlx-contentvec",
filename="contentvec_base.safetensors"
)
Validation
These weights produce numerically identical outputs to the original PyTorch implementation:
| Metric | Value |
|---|---|
| Max absolute difference | 7.3e-6 |
| Cosine similarity | 1.000000 |
Source Weights
Converted from hubert_base.pt (MD5: b76f784c1958d4e535cd0f6151ca35e4).
Use Cases
- Voice Conversion: Feature extraction for RVC pipeline
- Speaker Verification: Content-based audio embeddings
- Speech Analysis: Semantic feature extraction
Citation
@inproceedings{qian2022contentvec,
title={ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers},
author={Qian, Kaizhi and Zhang, Yang and Gao, Heting and Ni, Junrui and Lai, Cheng-I and Cox, David and Hasegawa-Johnson, Mark and Chang, Shiyu},
booktitle={International Conference on Machine Learning},
year={2022}
}
@article{hsu2021hubert,
title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
author={Hsu, Wei-Ning and others},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
year={2021}
}
License
MIT