MLX ContentVec / HuBERT Base

MLX-converted weights for ContentVec/HuBERT base model, optimized for Apple Silicon.

This model extracts speaker-agnostic semantic features from audio, primarily used as the feature extraction backbone for RVC (Retrieval-based Voice Conversion).

Model Details

  • Architecture: HuBERT Base (12 transformer layers)
  • Parameters: ~90M
  • Input: 16kHz mono audio
  • Output: 768-dimensional features (~50 frames/second)
  • Framework: MLX
  • Format: SafeTensors (float32)

Usage

import mlx.core as mx
import librosa
from mlx_contentvec import ContentvecModel

# Load model
model = ContentvecModel(encoder_layers_1=0)
model.load_weights("contentvec_base.safetensors")
model.eval()

# Load audio at 16kHz
audio, sr = librosa.load("input.wav", sr=16000, mono=True)
source = mx.array(audio).reshape(1, -1)

# Extract features
result = model(source)
features = result["x"]  # Shape: (1, num_frames, 768)

Installation

pip install git+https://github.com/example/mlx-contentvec.git

Download Weights

from huggingface_hub import hf_hub_download

weights_path = hf_hub_download(
    repo_id="lexandstuff/mlx-contentvec",
    filename="contentvec_base.safetensors"
)

Validation

These weights produce numerically identical outputs to the original PyTorch implementation:

Metric Value
Max absolute difference 7.3e-6
Cosine similarity 1.000000

Source Weights

Converted from hubert_base.pt (MD5: b76f784c1958d4e535cd0f6151ca35e4).

Use Cases

  • Voice Conversion: Feature extraction for RVC pipeline
  • Speaker Verification: Content-based audio embeddings
  • Speech Analysis: Semantic feature extraction

Citation

@inproceedings{qian2022contentvec,
  title={ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers},
  author={Qian, Kaizhi and Zhang, Yang and Gao, Heting and Ni, Junrui and Lai, Cheng-I and Cox, David and Hasegawa-Johnson, Mark and Chang, Shiyu},
  booktitle={International Conference on Machine Learning},
  year={2022}
}

@article{hsu2021hubert,
  title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
  author={Hsu, Wei-Ning and others},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  year={2021}
}

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train lexandstuff/mlx-contentvec