lexandstuff commited on
Commit
6c6cd43
·
verified ·
1 Parent(s): 774ea23

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +112 -0
README.md ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: mlx
4
+ tags:
5
+ - mlx
6
+ - audio
7
+ - speech
8
+ - feature-extraction
9
+ - contentvec
10
+ - hubert
11
+ - voice-conversion
12
+ - rvc
13
+ datasets:
14
+ - librispeech_asr
15
+ language:
16
+ - en
17
+ pipeline_tag: feature-extraction
18
+ ---
19
+
20
+ # MLX ContentVec / HuBERT Base
21
+
22
+ MLX-converted weights for ContentVec/HuBERT base model, optimized for Apple Silicon.
23
+
24
+ This model extracts speaker-agnostic semantic features from audio, primarily used as the feature extraction backbone for [RVC (Retrieval-based Voice Conversion)](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI).
25
+
26
+ ## Model Details
27
+
28
+ - **Architecture**: HuBERT Base (12 transformer layers)
29
+ - **Parameters**: ~90M
30
+ - **Input**: 16kHz mono audio
31
+ - **Output**: 768-dimensional features (~50 frames/second)
32
+ - **Framework**: [MLX](https://github.com/ml-explore/mlx)
33
+ - **Format**: SafeTensors (float32)
34
+
35
+ ## Usage
36
+
37
+ ```python
38
+ import mlx.core as mx
39
+ import librosa
40
+ from mlx_contentvec import ContentvecModel
41
+
42
+ # Load model
43
+ model = ContentvecModel(encoder_layers_1=0)
44
+ model.load_weights("contentvec_base.safetensors")
45
+ model.eval()
46
+
47
+ # Load audio at 16kHz
48
+ audio, sr = librosa.load("input.wav", sr=16000, mono=True)
49
+ source = mx.array(audio).reshape(1, -1)
50
+
51
+ # Extract features
52
+ result = model(source)
53
+ features = result["x"] # Shape: (1, num_frames, 768)
54
+ ```
55
+
56
+ ## Installation
57
+
58
+ ```bash
59
+ pip install git+https://github.com/example/mlx-contentvec.git
60
+ ```
61
+
62
+ ## Download Weights
63
+
64
+ ```python
65
+ from huggingface_hub import hf_hub_download
66
+
67
+ weights_path = hf_hub_download(
68
+ repo_id="lexandstuff/mlx-contentvec",
69
+ filename="contentvec_base.safetensors"
70
+ )
71
+ ```
72
+
73
+ ## Validation
74
+
75
+ These weights produce **numerically identical** outputs to the original PyTorch implementation:
76
+
77
+ | Metric | Value |
78
+ |--------|-------|
79
+ | Max absolute difference | 7.3e-6 |
80
+ | Cosine similarity | 1.000000 |
81
+
82
+ ## Source Weights
83
+
84
+ Converted from [hubert_base.pt](https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/hubert_base.pt) (MD5: `b76f784c1958d4e535cd0f6151ca35e4`).
85
+
86
+ ## Use Cases
87
+
88
+ - **Voice Conversion**: Feature extraction for RVC pipeline
89
+ - **Speaker Verification**: Content-based audio embeddings
90
+ - **Speech Analysis**: Semantic feature extraction
91
+
92
+ ## Citation
93
+
94
+ ```bibtex
95
+ @inproceedings{qian2022contentvec,
96
+ title={ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers},
97
+ author={Qian, Kaizhi and Zhang, Yang and Gao, Heting and Ni, Junrui and Lai, Cheng-I and Cox, David and Hasegawa-Johnson, Mark and Chang, Shiyu},
98
+ booktitle={International Conference on Machine Learning},
99
+ year={2022}
100
+ }
101
+
102
+ @article{hsu2021hubert,
103
+ title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units},
104
+ author={Hsu, Wei-Ning and others},
105
+ journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
106
+ year={2021}
107
+ }
108
+ ```
109
+
110
+ ## License
111
+
112
+ MIT