Feature Extraction
Transformers
Safetensors
xcodec
bezzam HF Staff commited on
Commit
e17193c
·
verified ·
1 Parent(s): 1ee9a72

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +129 -1
README.md CHANGED
@@ -9,4 +9,132 @@ datasets:
9
 
10
  This codec is intended for speech data.
11
 
12
- Original model is `xcodec_wavlm_more_data` from [this table](https://github.com/zhenye234/xcodec?tab=readme-ov-file#available-models).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
  This codec is intended for speech data.
11
 
12
+ Original model is `xcodec_hubert_librispeech` from [this table](https://github.com/zhenye234/xcodec?tab=readme-ov-file#available-models).
13
+
14
+ ## Example usage
15
+
16
+ The example below applies the codec over all possible bandwidths.
17
+
18
+ ```python
19
+
20
+ from datasets import Audio, load_dataset
21
+ from transformers import XcodecModel, AutoFeatureExtractor
22
+ import torch
23
+ import os
24
+ from scipy.io.wavfile import write as write_wav
25
+
26
+
27
+ model_id = "hf-audio/xcodec-hubert-librispeech"
28
+ torch_device = "cuda" if torch.cuda.is_available() else "cpu"
29
+ available_bandwidths = [0.5, 1, 1.5, 2, 4]
30
+
31
+ # load model
32
+ model = XcodecModel.from_pretrained(model_id, device_map=torch_device)
33
+ feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
34
+
35
+ # load audio example
36
+ librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
37
+ librispeech_dummy = librispeech_dummy.cast_column(
38
+ "audio", Audio(sampling_rate=feature_extractor.sampling_rate)
39
+ )
40
+ audio_array = librispeech_dummy[0]["audio"]["array"]
41
+ inputs = feature_extractor(
42
+ raw_audio=audio_array, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt"
43
+ ).to(model.device)
44
+ audio = inputs["input_values"]
45
+
46
+ for bandwidth in available_bandwidths:
47
+ print(f"Encoding with bandwidth: {bandwidth} kbps")
48
+ # encode
49
+ audio_codes = model.encode(audio, bandwidth=bandwidth, return_dict=False)
50
+ print("Codebook shape", audio_codes.shape)
51
+ # 0.5 kbps -> torch.Size([1, 1, 293])
52
+ # 1.0 kbps -> torch.Size([1, 2, 293])
53
+ # 1.5 kbps -> torch.Size([1, 3, 293])
54
+ # 2.0 kbps -> torch.Size([1, 4, 293])
55
+ # 4.0 kbps -> torch.Size([1, 8, 293])
56
+
57
+ # decode
58
+ input_values_dec = model.decode(audio_codes).audio_values
59
+
60
+ # save audio to file
61
+ write_wav(f"{os.path.basename(model_id)}_{bandwidth}.wav", feature_extractor.sampling_rate, input_values_dec.squeeze().detach().cpu().numpy())
62
+
63
+ write_wav("original.wav", feature_extractor.sampling_rate, audio.squeeze().detach().cpu().numpy())
64
+ ```
65
+
66
+ ### 🔊 Audio Samples
67
+
68
+ **Original**
69
+ <audio controls>
70
+ <source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/original.wav" type="audio/wav">
71
+ </audio>
72
+
73
+ **0.5 kbps**
74
+ <audio controls>
75
+ <source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/xcodec-hubert-librispeech_0.5.wav" type="audio/wav">
76
+ </audio>
77
+
78
+ **1 kbps**
79
+ <audio controls>
80
+ <source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/xcodec-hubert-librispeech_1.wav" type="audio/wav">
81
+ </audio>
82
+
83
+ **1.5 kbps**
84
+ <audio controls>
85
+ <source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/xcodec-hubert-librispeech_1.5.wav" type="audio/wav">
86
+ </audio>
87
+
88
+ **2 kbps**
89
+ <audio controls>
90
+ <source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/xcodec-hubert-librispeech_2.wav" type="audio/wav">
91
+ </audio>
92
+
93
+ **4 kbps**
94
+ <audio controls>
95
+ <source src="https://huggingface.co/datasets/bezzam/xcodec_samples/resolve/main/xcodec-hubert-librispeech_4.wav" type="audio/wav">
96
+ </audio>
97
+
98
+ ## Batch example
99
+
100
+ ```python
101
+
102
+ from datasets import Audio, load_dataset
103
+ from transformers import XcodecModel, AutoFeatureExtractor
104
+ import torch
105
+
106
+
107
+ model_id = "hf-audio/xcodec-hubert-librispeech"
108
+ torch_device = "cuda" if torch.cuda.is_available() else "cpu"
109
+ bandwidth = 4
110
+ n_audio = 2 # number of audio samples to process in a batch
111
+
112
+ # load model
113
+ model = XcodecModel.from_pretrained(model_id, device_map=torch_device)
114
+ feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
115
+
116
+ # load audio example
117
+ ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
118
+ ds = ds.cast_column(
119
+ "audio", Audio(sampling_rate=feature_extractor.sampling_rate)
120
+ )
121
+ audio = [audio_sample["array"] for audio_sample in ds[-n_audio:]["audio"]]
122
+ print(f"Input audio shape: {[_sample.shape for _sample in audio]}")
123
+ # Input audio shape: [(113840,), (71680,)]
124
+ inputs = feature_extractor(
125
+ raw_audio=audio, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt"
126
+ ).to(model.device)
127
+ audio = inputs["input_values"]
128
+ print(f"Padded audio shape: {audio.shape}")
129
+ # Padded audio shape: torch.Size([2, 1, 113920])
130
+
131
+ # encode
132
+ audio_codes = model.encode(audio, bandwidth=bandwidth, return_dict=False)
133
+ print("Codebook shape", audio_codes.shape)
134
+ # Codebook shape torch.Size([2, 8, 356])
135
+
136
+ # decode
137
+ decoded_audio = model.decode(audio_codes).audio_values
138
+ print("Decoded audio shape", decoded_audio.shape)
139
+ # Decoded audio shape torch.Size([2, 1, 113920])
140
+ ```