bezzam HF Staff commited on
Commit
c831d9c
·
verified ·
1 Parent(s): c2f2c75

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +139 -2
README.md CHANGED
@@ -7,6 +7,143 @@ base_model:
7
 
8
  # X-Codec (general audio)
9
 
10
- This codec can be used for general audio.
11
 
12
- Original model is `xcodec_hubert_general_audio` from [this table](https://github.com/zhenye234/xcodec?tab=readme-ov-file#available-models).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
 
8
  # X-Codec (general audio)
9
 
10
+ This codec is part of the X-Codec family of codecs as shown below:
11
 
12
+ | Model checkpoint | Semantic Model | Domain | Training Data |
13
+ |--------------------------------------------|-----------------------------------------------------------------------|---------------|-------------------------------|
14
+ | [xcodec-hubert-librispeech](https://huggingface.co/hf-audio/xcodec-hubert-librispeech) | [facebook/hubert-base-ls960](https://huggingface.co/facebook/hubert-base-ls960) | Speech | Librispeech |
15
+ | [xcodec-wavlm-mls](https://huggingface.co/hf-audio/xcodec-wavlm-mls) | [microsoft/wavlm-base-plus](https://huggingface.co/microsoft/wavlm-base-plus)| Speech | MLS English |
16
+ | [xcodec-wavlm-more-data](https://huggingface.co/hf-audio/xcodec-wavlm-more-data) | [microsoft/wavlm-base-plus](https://huggingface.co/microsoft/wavlm-base-plus)| Speech | MLS English + Internal data |
17
+ | [xcodec-hubert-general](https://huggingface.co/hf-audio/xcodec-hubert-general) | [ZhenYe234/hubert_base_general_audio](https://huggingface.co/ZhenYe234/hubert_base_general_audio) | General audio | 200k hours internal data |
18
+ | [xcodec-hubert-general-balanced](https://huggingface.co/hf-audio/xcodec-hubert-general-balanced) (this model) | [ZhenYe234/hubert_base_general_audio](https://huggingface.co/ZhenYe234/hubert_base_general_audio) | General audio | More balanced data |
19
+
20
+
21
+ Original model is `xcodec_hubert_general_audio_more_data` from [this table](https://github.com/zhenye234/xcodec?tab=readme-ov-file#available-models).
22
+
23
+ ## Example usage
24
+
25
+ The example below applies the codec over all possible bandwidths.
26
+
27
+ ```python
28
+
29
+ from datasets import Audio, load_dataset
30
+ from transformers import XcodecModel, AutoFeatureExtractor
31
+ import torch
32
+ import os
33
+ from scipy.io.wavfile import write as write_wav
34
+
35
+
36
+ model_id = "hf-audio/xcodec-hubert-general-balanced"
37
+ torch_device = "cuda" if torch.cuda.is_available() else "cpu"
38
+ available_bandwidths = [0.5, 1, 1.5, 2, 4]
39
+
40
+ # load model
41
+ model = XcodecModel.from_pretrained(model_id, device_map=torch_device)
42
+ feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
43
+
44
+ # load audio example
45
+ librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
46
+ librispeech_dummy = librispeech_dummy.cast_column(
47
+ "audio", Audio(sampling_rate=feature_extractor.sampling_rate)
48
+ )
49
+ audio_array = librispeech_dummy[0]["audio"]["array"]
50
+ inputs = feature_extractor(
51
+ raw_audio=audio_array, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt"
52
+ ).to(model.device)
53
+ audio = inputs["input_values"]
54
+
55
+ for bandwidth in available_bandwidths:
56
+ print(f"Encoding with bandwidth: {bandwidth} kbps")
57
+ # encode
58
+ audio_codes = model.encode(audio, bandwidth=bandwidth, return_dict=False)
59
+ print("Codebook shape", audio_codes.shape)
60
+ # 0.5 kbps -> torch.Size([1, 1, 293])
61
+ # 1.0 kbps -> torch.Size([1, 2, 293])
62
+ # 1.5 kbps -> torch.Size([1, 3, 293])
63
+ # 2.0 kbps -> torch.Size([1, 4, 293])
64
+ # 4.0 kbps -> torch.Size([1, 8, 293])
65
+
66
+ # decode
67
+ input_values_dec = model.decode(audio_codes).audio_values
68
+
69
+ # save audio to file
70
+ write_wav(f"{os.path.basename(model_id)}_{bandwidth}.wav", feature_extractor.sampling_rate, input_values_dec.squeeze().detach().cpu().numpy())
71
+
72
+ write_wav("original.wav", feature_extractor.sampling_rate, audio.squeeze().detach().cpu().numpy())
73
+ ```
74
+
75
+ ### 🔊 Audio Samples
76
+
77
+ **Original**
78
+ <audio controls>
79
+ <source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/original.wav" type="audio/wav">
80
+ </audio>
81
+
82
+ **0.5 kbps**
83
+ <audio controls>
84
+ <source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/xcodec-hubert-general-balanced_0.5.wav" type="audio/wav">
85
+ </audio>
86
+
87
+ **1 kbps**
88
+ <audio controls>
89
+ <source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/xcodec-hubert-general-balanced_1.wav" type="audio/wav">
90
+ </audio>
91
+
92
+ **1.5 kbps**
93
+ <audio controls>
94
+ <source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/xcodec-hubert-general-balanced_1.5.wav" type="audio/wav">
95
+ </audio>
96
+
97
+ **2 kbps**
98
+ <audio controls>
99
+ <source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/xcodec-hubert-general-balanced_2.wav" type="audio/wav">
100
+ </audio>
101
+
102
+ **4 kbps**
103
+ <audio controls>
104
+ <source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/xcodec-hubert-general-balanced_4.wav" type="audio/wav">
105
+ </audio>
106
+
107
+ ## Batch example
108
+
109
+ ```python
110
+
111
+ from datasets import Audio, load_dataset
112
+ from transformers import XcodecModel, AutoFeatureExtractor
113
+ import torch
114
+
115
+
116
+ model_id = "hf-audio/xcodec-hubert-general-balanced"
117
+ torch_device = "cuda" if torch.cuda.is_available() else "cpu"
118
+ bandwidth = 4
119
+ n_audio = 2 # number of audio samples to process in a batch
120
+
121
+ # load model
122
+ model = XcodecModel.from_pretrained(model_id, device_map=torch_device)
123
+ feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
124
+
125
+ # load audio example
126
+ ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
127
+ ds = ds.cast_column(
128
+ "audio", Audio(sampling_rate=feature_extractor.sampling_rate)
129
+ )
130
+ audio = [audio_sample["array"] for audio_sample in ds[-n_audio:]["audio"]]
131
+ print(f"Input audio shape: {[_sample.shape for _sample in audio]}")
132
+ # Input audio shape: [(113840,), (71680,)]
133
+ inputs = feature_extractor(
134
+ raw_audio=audio, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt"
135
+ ).to(model.device)
136
+ audio = inputs["input_values"]
137
+ print(f"Padded audio shape: {audio.shape}")
138
+ # Padded audio shape: torch.Size([2, 1, 113920])
139
+
140
+ # encode
141
+ audio_codes = model.encode(audio, bandwidth=bandwidth, return_dict=False)
142
+ print("Codebook shape", audio_codes.shape)
143
+ # Codebook shape torch.Size([2, 8, 356])
144
+
145
+ # decode
146
+ decoded_audio = model.decode(audio_codes).audio_values
147
+ print("Decoded audio shape", decoded_audio.shape)
148
+ # Decoded audio shape torch.Size([2, 1, 113920])
149
+ ```