--- license: mit library_name: mlx tags: - mlx - audio - speech - feature-extraction - contentvec - hubert - voice-conversion - rvc datasets: - librispeech_asr language: - en pipeline_tag: feature-extraction --- # MLX ContentVec / HuBERT Base MLX-converted weights for ContentVec/HuBERT base model, optimized for Apple Silicon. This model extracts speaker-agnostic semantic features from audio, primarily used as the feature extraction backbone for [RVC (Retrieval-based Voice Conversion)](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI). ## Model Details - **Architecture**: HuBERT Base (12 transformer layers) - **Parameters**: ~90M - **Input**: 16kHz mono audio - **Output**: 768-dimensional features (~50 frames/second) - **Framework**: [MLX](https://github.com/ml-explore/mlx) - **Format**: SafeTensors (float32) ## Usage ```python import mlx.core as mx import librosa from mlx_contentvec import ContentvecModel # Load model model = ContentvecModel(encoder_layers_1=0) model.load_weights("contentvec_base.safetensors") model.eval() # Load audio at 16kHz audio, sr = librosa.load("input.wav", sr=16000, mono=True) source = mx.array(audio).reshape(1, -1) # Extract features result = model(source) features = result["x"] # Shape: (1, num_frames, 768) ``` ## Installation ```bash pip install git+https://github.com/example/mlx-contentvec.git ``` ## Download Weights ```python from huggingface_hub import hf_hub_download weights_path = hf_hub_download( repo_id="lexandstuff/mlx-contentvec", filename="contentvec_base.safetensors" ) ``` ## Validation These weights produce **numerically identical** outputs to the original PyTorch implementation: | Metric | Value | |--------|-------| | Max absolute difference | 7.3e-6 | | Cosine similarity | 1.000000 | ## Source Weights Converted from [hubert_base.pt](https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/hubert_base.pt) (MD5: `b76f784c1958d4e535cd0f6151ca35e4`). ## Use Cases - **Voice Conversion**: Feature extraction for RVC pipeline - **Speaker Verification**: Content-based audio embeddings - **Speech Analysis**: Semantic feature extraction ## Citation ```bibtex @inproceedings{qian2022contentvec, title={ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers}, author={Qian, Kaizhi and Zhang, Yang and Gao, Heting and Ni, Junrui and Lai, Cheng-I and Cox, David and Hasegawa-Johnson, Mark and Chang, Shiyu}, booktitle={International Conference on Machine Learning}, year={2022} } @article{hsu2021hubert, title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units}, author={Hsu, Wei-Ning and others}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, year={2021} } ``` ## License MIT