juan4pro12 commited on
Commit
7d38920
·
verified ·
1 Parent(s): c005e9b

Add custom endpoint handler

Browse files
Files changed (3) hide show
  1. README.md +16 -545
  2. handler.py +25 -0
  3. requirements.txt +5 -0
README.md CHANGED
@@ -1,545 +1,16 @@
1
- ---
2
- language:
3
- - en
4
- - zh
5
- - es
6
- - pt
7
- - de
8
- - ja
9
- - ko
10
- - fr
11
- - ru
12
- - id
13
- - sv
14
- - it
15
- - he
16
- - nl
17
- - pl
18
- - 'no'
19
- - tr
20
- - th
21
- - ar
22
- - hu
23
- - ca
24
- - cs
25
- - da
26
- - fa
27
- - af
28
- - hi
29
- - fi
30
- - et
31
- - aa
32
- - el
33
- - ro
34
- - vi
35
- - bg
36
- - is
37
- - sl
38
- - sk
39
- - lt
40
- - sw
41
- - uk
42
- - kl
43
- - lv
44
- - hr
45
- - ne
46
- - sr
47
- - tl
48
- - yi
49
- - ms
50
- - ur
51
- - mn
52
- - hy
53
- - jv
54
- license: mit
55
- pipeline_tag: audio-text-to-text
56
- tags:
57
- - ASR
58
- - Diarization
59
- - Speech-to-Text
60
- - Transcription
61
- library_name: transformers
62
- ---
63
-
64
-
65
- ## VibeVoice-ASR (Transformers-compatible version)
66
- [![GitHub](https://img.shields.io/badge/GitHub-Repo-black?logo=github)](https://github.com/microsoft/VibeVoice)
67
- [![Deploy on Foundry](https://img.shields.io/badge/Deploy-on_Foundry-blue?logo=microsoft)](https://huggingface.co/docs/microsoft-azure/foundry/guides/vibevoice-asr)
68
- [![Live Playground](https://img.shields.io/badge/Live-Playground-green?logo=gradio)](https://aka.ms/vibevoice-asr)
69
- [![Technical Report](https://img.shields.io/badge/arXiv-2601.18184-b31b1b?logo=arxiv)](https://arxiv.org/pdf/2601.18184)
70
-
71
- **VibeVoice-ASR** is a unified speech-to-text model designed to handle **60-minute long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **Customized Hotwords** and over **50 languages**.
72
-
73
- ➡️ **Demo:** [VibeVoice-ASR-Demo](https://aka.ms/vibevoice-asr)<br>
74
- ➡️ **Report:** [VibeVoice-ASR Technical Report](https://arxiv.org/pdf/2601.18184)<br>
75
-
76
- <p align="left">
77
- <img src="figures/VibeVoice_ASR_archi.png" alt="VibeVoice-ASR Architecture" height="250px">
78
- </p>
79
-
80
-
81
- ## 🔥 Key Features
82
-
83
-
84
- - **🕒 60-minute Single-Pass Processing**:
85
- Unlike conventional ASR models that slice audio into short chunks (often losing global context), VibeVoice ASR accepts up to **60 minutes** of continuous audio input within 64K token length. This ensures consistent speaker tracking and semantic coherence across the entire hour.
86
-
87
- - **👤 Customized Hotwords**:
88
- Users can provide customized hotwords (e.g., specific names, technical terms, or background info) to guide the recognition process, significantly improving accuracy on domain-specific content.
89
-
90
- - **📝 Rich Transcription (Who, When, What)**:
91
- The model jointly performs ASR, diarization, and timestamping, producing a structured output that indicates *who* said *what* and *when*.
92
-
93
- - **🌍 Multilingual & Code-Switching Support**:
94
- It supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Language distribution can be found [here](#language-distribution).
95
-
96
-
97
- ## Usage
98
-
99
- ### Setup
100
-
101
- VibeVoice-ASR is available as of v5.3.0 of Transformers!
102
-
103
- ```
104
- pip install "transformers>=5.3.0"
105
- ```
106
-
107
- ### Loading model
108
-
109
- ```python
110
- from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
111
-
112
- model_id = "microsoft/VibeVoice-ASR-HF"
113
- processor = AutoProcessor.from_pretrained(model_id)
114
- model = VibeVoiceAsrForConditionalGeneration.from_pretrained(model_id)
115
- ```
116
-
117
- ### Speaker-timestamped transcription
118
-
119
- A notable feature of VibeVoice ASR is its ability to transcribe multi-speaker content, denoting who spoke and when.
120
-
121
- The example below transcribes the following audio.
122
-
123
- <audio controls>
124
- <source src="https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav" type="audio/wav">
125
- </audio>
126
-
127
- ```python
128
- from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
129
-
130
- model_id = "microsoft/VibeVoice-ASR-HF"
131
- processor = AutoProcessor.from_pretrained(model_id)
132
- model = VibeVoiceAsrForConditionalGeneration.from_pretrained(model_id, device_map="auto")
133
- print(f"Model loaded on {model.device} with dtype {model.dtype}")
134
-
135
- # Prepare inputs using `apply_transcription_request`
136
- inputs = processor.apply_transcription_request(
137
- audio="https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav",
138
- ).to(model.device, model.dtype)
139
-
140
- # Apply model
141
- output_ids = model.generate(**inputs)
142
- generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
143
- transcription = processor.decode(generated_ids)[0]
144
- print("\n" + "=" * 60)
145
- print("RAW OUTPUT")
146
- print("=" * 60)
147
- print(transcription)
148
-
149
- transcription = processor.decode(generated_ids, return_format="parsed")[0]
150
- print("\n" + "=" * 60)
151
- print("TRANSCRIPTION (list of dicts)")
152
- print("=" * 60)
153
- for speaker_transcription in transcription:
154
- print(speaker_transcription)
155
-
156
- # Remove speaker labels, only get raw transcription
157
- transcription = processor.decode(generated_ids, return_format="transcription_only")[0]
158
- print("\n" + "=" * 60)
159
- print("TRANSCRIPTION ONLY")
160
- print("=" * 60)
161
- print(transcription)
162
-
163
- """
164
- ============================================================
165
- RAW OUTPUT
166
- ============================================================
167
- <|im_start|>assistant
168
- [{"Start":0,"End":15.43,"Speaker":0,"Content":"Hello everyone and welcome to the Vibe Voice podcast. I'm your host, Alex, and today we're getting into one of the biggest debates in all of sports: who's the greatest basketball player of all time? I'm so excited to have Sam here to talk about it with me."},{"Start":15.43,"End":21.05,"Speaker":1,"Content":"Thanks so much for having me, Alex. And you're absolutely right. This question always brings out some seriously strong feelings."},{"Start":21.05,"End":31.66,"Speaker":0,"Content":"Okay, so let's get right into it. For me, it has to be Michael Jordan. Six trips to the finals, six championships. That kind of perfection is just incredible."},{"Start":31.66,"End":40.93,"Speaker":1,"Content":"Oh man, the first thing that always pops into my head is that shot against the Cleveland Cavaliers back in '89. Jordan just rises, hangs in the air forever, and just sinks it."}]<|im_end|>
169
- <|endoftext|>
170
-
171
- ============================================================
172
- TRANSCRIPTION (list of dicts)
173
- ============================================================
174
- {'Start': 0, 'End': 15.43, 'Speaker': 0, 'Content': "Hello everyone and welcome to the Vibe Voice podcast. I'm your host, Alex, and today we're getting into one of the biggest debates in all of sports: who's the greatest basketball player of all time? I'm so excited to have Sam here to talk about it with me."}
175
- {'Start': 15.43, 'End': 21.05, 'Speaker': 1, 'Content': "Thanks so much for having me, Alex. And you're absolutely right. This question always brings out some seriously strong feelings."}
176
- {'Start': 21.05, 'End': 31.66, 'Speaker': 0, 'Content': "Okay, so let's get right into it. For me, it has to be Michael Jordan. Six trips to the finals, six championships. That kind of perfection is just incredible."}
177
- {'Start': 31.66, 'End': 40.93, 'Speaker': 1, 'Content': "Oh man, the first thing that always pops into my head is that shot against the Cleveland Cavaliers back in '89. Jordan just rises, hangs in the air forever, and just sinks it."}
178
-
179
- ============================================================
180
- TRANSCRIPTION ONLY
181
- ============================================================
182
- Hello everyone and welcome to the Vibe Voice podcast. I'm your host, Alex, and today we're getting into one of the biggest debates in all of sports: who's the greatest basketball player of all time? I'm so excited to have Sam here to talk about it with me. Thanks so much for having me, Alex. And you're absolutely right. This question always brings out some seriously strong feelings. Okay, so let's get right into it. For me, it has to be Michael Jordan. Six trips to the finals, six championships. That kind of perfection is just incredible. Oh man, the first thing that always pops into my head is that shot against the Cleveland Cavaliers back in '89. Jordan just rises, hangs in the air forever, and just sinks it.
183
- """
184
- ```
185
-
186
- The VibeVoice ASR model is trained to generate a string that resembles a JSON structure. The flag `return_format="parsed"` tries to return the generated output as a list of dicts, while `return_format="transcription_only"` tries to extract only the transcribed audio. If they fail, the generated output is returned as-is.
187
-
188
- ### Providing context
189
-
190
- It is also possible to provide context. This can be useful if certain words cannot be transcribed correctly, such as proper nouns.
191
-
192
- Below we transcribe an audio where the speaker (with a German accent) talks about VibeVoice, comparing with and without the context "About VibeVoice".
193
-
194
- <audio controls>
195
- <source src="https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav" type="audio/wav">
196
- </audio>
197
-
198
- ```python
199
- from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
200
-
201
- model_id = "microsoft/VibeVoice-ASR-HF"
202
- processor = AutoProcessor.from_pretrained(model_id)
203
- model = VibeVoiceAsrForConditionalGeneration.from_pretrained(model_id, device_map="auto")
204
- print(f"Model loaded on {model.device} with dtype {model.dtype}")
205
-
206
- # Without context
207
- inputs = processor.apply_transcription_request(
208
- audio="https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
209
- ).to(model.device, model.dtype)
210
- output_ids = model.generate(**inputs)
211
- generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
212
- transcription = processor.decode(generated_ids, return_format="transcription_only")[0]
213
- print(f"WITHOUT CONTEXT: {transcription}")
214
-
215
- # With context
216
- inputs = processor.apply_transcription_request(
217
- audio="https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
218
- prompt="About VibeVoice",
219
- ).to(model.device, model.dtype)
220
- output_ids = model.generate(**inputs)
221
- generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
222
- transcription = processor.decode(generated_ids, return_format="transcription_only")[0]
223
- print(f"WITH CONTEXT : {transcription}")
224
-
225
- """
226
- WITHOUT CONTEXT: Revevoices is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio.
227
- WITH CONTEXT : VibeVoice is this novel framework designed for generating expressive, long-form, multi-speaker, conversational audio.
228
- """
229
- ```
230
-
231
-
232
- ### Batch inference
233
-
234
- Batch inference is possible by passing a list of audio and (if provided) a list of prompts of equal length.
235
-
236
- ```python
237
- from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
238
-
239
- model_id = "microsoft/VibeVoice-ASR-HF"
240
- audio = [
241
- "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
242
- "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav"
243
- ]
244
- prompts = ["About VibeVoice", None]
245
-
246
- processor = AutoProcessor.from_pretrained(model_id)
247
- model = VibeVoiceAsrForConditionalGeneration.from_pretrained(model_id, device_map="auto")
248
- print(f"Model loaded on {model.device} with dtype {model.dtype}")
249
-
250
- inputs = processor.apply_transcription_request(audio, prompt=prompts).to(model.device, model.dtype)
251
- output_ids = model.generate(**inputs)
252
- generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
253
- transcription = processor.decode(generated_ids, return_format="transcription_only")
254
-
255
- print(transcription)
256
- ```
257
-
258
- ### Adjusting tokenizer chunk (e.g. if out-of-memory)
259
-
260
- A key feature of VibeVoice ASR is that it can transcribe up to 60 minutes of continuous audio. This is done by chunking audio into 60-second segments (1440000 samples at 24kHz) and caching the convolution states between each segment.
261
-
262
- However, if chunks of 60 seconds are too large for your device, the `tokenizer_chunk_size` argument passed to `generate` can be adjusted. *Note it should be a multiple of the hop length (3200 for the original acoustic tokenizer).*
263
-
264
- ```python
265
- from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
266
-
267
- tokenizer_chunk_size = 64000 # default is 1440000 (60s @ 24kHz)
268
- model_id = "microsoft/VibeVoice-ASR-HF"
269
- audio = [
270
- "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
271
- "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav"
272
- ]
273
- prompts = ["About VibeVoice", None]
274
-
275
- processor = AutoProcessor.from_pretrained(model_id)
276
- model = VibeVoiceAsrForConditionalGeneration.from_pretrained(model_id, device_map="auto")
277
- print(f"Model loaded on {model.device} with dtype {model.dtype}")
278
-
279
- inputs = processor.apply_transcription_request(audio, prompt=prompts).to(model.device, model.dtype)
280
- output_ids = model.generate(**inputs, tokenizer_chunk_size=tokenizer_chunk_size)
281
- generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
282
- transcription = processor.decode(generated_ids, return_format="transcription_only")
283
- print(transcription)
284
- ```
285
-
286
- ### Chat template
287
-
288
- VibeVoice ASR also accepts chat template inputs (`apply_transcription_request` is actually a wrapper for `apply_chat_template` for convenience):
289
- ```python
290
- from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
291
-
292
- model_id = "microsoft/VibeVoice-ASR-HF"
293
- processor = AutoProcessor.from_pretrained(model_id)
294
- model = VibeVoiceAsrForConditionalGeneration.from_pretrained(model_id, device_map="auto")
295
-
296
- chat_template = [
297
- [
298
- {
299
- "role": "user",
300
- "content": [
301
- {"type": "text", "text": "About VibeVoice"},
302
- {
303
- "type": "audio",
304
- "path": "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
305
- },
306
- ],
307
- }
308
- ],
309
- [
310
- {
311
- "role": "user",
312
- "content": [
313
- {
314
- "type": "audio",
315
- "path": "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav",
316
- },
317
- ],
318
- }
319
- ],
320
- ]
321
-
322
- inputs = processor.apply_chat_template(
323
- chat_template,
324
- tokenize=True,
325
- return_dict=True,
326
- ).to(model.device, model.dtype)
327
-
328
- output_ids = model.generate(**inputs)
329
- generated_ids = output_ids[:, inputs["input_ids"].shape[1] :]
330
- transcription = processor.decode(generated_ids, return_format="transcription_only")
331
- print(transcription)
332
- ```
333
-
334
- ### Training
335
-
336
- VibeVoice ASR can be trained with the loss outputted by the model.
337
-
338
- ```python
339
- from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
340
-
341
- model_id = "microsoft/VibeVoice-ASR-HF"
342
- processor = AutoProcessor.from_pretrained(model_id)
343
- model = VibeVoiceAsrForConditionalGeneration.from_pretrained(model_id, device_map="auto")
344
- model.train()
345
-
346
- # Prepare batch of 2
347
- # -- NOTE: the original model is trained to output transcription, speaker ID, and timestamps in JSON-like format. Below we are only using the transcription text as the label
348
- chat_template = [
349
- [
350
- {
351
- "role": "user",
352
- "content": [
353
- {"type": "text", "text": "VibeVoice is this novel framework designed for generating expressive, long-form, multi-speaker, conversational audio."},
354
- {
355
- "type": "audio",
356
- "path": "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
357
- },
358
- ],
359
- }
360
- ],
361
- [
362
- {
363
- "role": "user",
364
- "content": [
365
- {"type": "text", "text": "Hello everyone and welcome to the VibeVoice podcast. I'm your host, Alex, and today we're getting into one of the biggest debates in all of sports: who's the greatest basketball player of all time? I'm so excited to have Sam here to talk about it with me. Thanks so much for having me, Alex. And you're absolutely right. This question always brings out some seriously strong feelings. Okay, so let's get right into it. For me, it has to be Michael Jordan. Six trips to the finals, six championships. That kind of perfection is just incredible. Oh man, the first thing that always pops into my head is that shot against the Cleveland Cavaliers back in '89. Jordan just rises, hangs in the air forever, and just sinks it."},
366
- {
367
- "type": "audio",
368
- "path": "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/example_output/VibeVoice-1.5B_output.wav",
369
- },
370
- ],
371
- }
372
- ],
373
- ]
374
- inputs = processor.apply_chat_template(
375
- chat_template,
376
- tokenize=True,
377
- return_dict=True,
378
- output_labels=True,
379
- ).to(model.device, model.dtype)
380
-
381
- loss = model(**inputs).loss
382
- print("Loss:", loss.item())
383
- loss.backward()
384
- ```
385
-
386
- ### Torch compile
387
-
388
- The model can be compiled for faster inference/training.
389
- ```python
390
- import time
391
- import torch
392
- from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
393
-
394
- model_id = "microsoft/VibeVoice-ASR-HF"
395
-
396
- num_warmup = 5
397
- num_runs = 20
398
-
399
- # Load processor + model
400
- processor = AutoProcessor.from_pretrained(model_id)
401
- model = VibeVoiceAsrForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16,).to("cuda")
402
-
403
- # Prepare static inputs
404
- chat_template = [
405
- [
406
- {
407
- "role": "user",
408
- "content": [
409
- {
410
- "type": "text",
411
- "text": "VibeVoice is this novel framework designed for generating expressive, long-form, multi-speaker, conversational audio.",
412
- },
413
- {
414
- "type": "audio",
415
- "path": "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
416
- },
417
- ],
418
- }
419
- ],
420
- ] * 4 # batch size 4
421
- inputs = processor.apply_chat_template(
422
- chat_template,
423
- tokenize=True,
424
- return_dict=True,
425
- ).to("cuda", torch.bfloat16)
426
-
427
- # Benchmark without compile
428
- print("Warming up without compile...")
429
- with torch.no_grad():
430
- for _ in range(num_warmup):
431
- _ = model(**inputs)
432
-
433
- torch.cuda.synchronize()
434
-
435
- print("\nBenchmarking without torch.compile...")
436
- torch.cuda.synchronize()
437
- start = time.time()
438
- with torch.no_grad():
439
- for _ in range(num_runs):
440
- _ = model(**inputs)
441
- torch.cuda.synchronize()
442
- no_compile_time = (time.time() - start) / num_runs
443
- print(f"Average time without compile: {no_compile_time:.4f}s")
444
-
445
- # Benchmark with compile
446
- print("\nCompiling model...")
447
- model = torch.compile(model)
448
-
449
- print("Warming up with compile (includes graph capture)...")
450
- with torch.no_grad():
451
- for _ in range(num_warmup):
452
- _ = model(**inputs)
453
-
454
- torch.cuda.synchronize()
455
-
456
- print("\nBenchmarking with torch.compile...")
457
- torch.cuda.synchronize()
458
- start = time.time()
459
- with torch.no_grad():
460
- for _ in range(num_runs):
461
- _ = model(**inputs)
462
- torch.cuda.synchronize()
463
- compile_time = (time.time() - start) / num_runs
464
- print(f"Average time with compile: {compile_time:.4f}s")
465
-
466
- speedup = no_compile_time / compile_time
467
- print(f"\nSpeedup: {speedup:.2f}x")
468
- ```
469
-
470
- ### Pipeline usage
471
-
472
- The model can be used as a pipeline, but you will have to define your own methods for parsing the raw output.
473
-
474
- ```python
475
- from transformers import pipeline
476
-
477
- model_id = "microsoft/VibeVoice-ASR-HF"
478
- pipe = pipeline("any-to-any", model=model_id, device_map="auto")
479
- chat_template = [
480
- {
481
- "role": "user",
482
- "content": [
483
- {"type": "text", "text": "About VibeVoice"},
484
- {
485
- "type": "audio",
486
- "path": "https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav",
487
- },
488
- ],
489
- }
490
- ]
491
- outputs = pipe(text=chat_template, return_full_text=False)
492
-
493
- print("\n" + "=" * 60)
494
- print("RAW PIPELINE OUTPUT")
495
- print("=" * 60)
496
- print(outputs)
497
-
498
- """
499
- ============================================================
500
- RAW PIPELINE OUTPUT
501
- ============================================================
502
- [{'input_text': [{'role': 'user', 'content': [{'type': 'text', 'text': 'About VibeVoice'}, {'type': 'audio', 'path': 'https://huggingface.co/datasets/bezzam/vibevoice_samples/resolve/main/realtime_model/vibevoice_tts_german.wav'}]}], 'generated_text': 'assistant\n[{"Start":0.0,"End":7.56,"Speaker":0,"Content":"VibeVoice is this novel framework designed for generating expressive, long-form, multi-speaker conversational audio."}]\n'}]
503
- """
504
- ```
505
-
506
- ## Evaluation
507
-
508
- Below are results from the [technical report](https://arxiv.org/pdf/2601.18184).
509
-
510
- <p align="center">
511
- <img src="figures/DER.jpg" alt="DER" width="70%">
512
- <img src="figures/cpWER.jpg" alt="cpWER" width="70%">
513
- <img src="figures/tcpWER.jpg" alt="tcpWER" width="70%">
514
- </p>
515
-
516
-
517
- ### Open ASR Leaderboard
518
-
519
- On the [Open ASR leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard), the following results were obtained:
520
-
521
- | Dataset | WER (%) |
522
- | ---------------------- | -------- |
523
- | ami_test | 17.20 |
524
- | earnings22_test | 13.17 |
525
- | gigaspeech_test | 9.67 |
526
- | librispeech_test.clean | 2.20 |
527
- | librispeech_test.other | 5.51 |
528
- | spgispeech_test | 3.80 |
529
- | tedlium_test | 2.57 |
530
- | voxpopuli_test | 8.01 |
531
- | **Average** | **7.77** |
532
- | **RTFx** | **51.80** |
533
-
534
- ## Language Distribution
535
- <p align="center">
536
- <img src="figures/language_distribution_horizontal.png" alt="Language Distribution" width="80%">
537
- </p>
538
-
539
-
540
- ## License
541
- This project is licensed under the MIT License.
542
-
543
- ## Contact
544
- This project was conducted by members of Microsoft Research. We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected/offensive behavior in our technology, please contact us at VibeVoice@microsoft.com.
545
- If the team receives reports of undesired behavior or identifies issues independently, we will update this repository with appropriate mitigations.
 
1
+ # vibevoice-custom-handler
2
+
3
+ Custom handler para desplegar `microsoft/VibeVoice-ASR-HF` en un Inference Endpoint dedicado de Hugging Face.
4
+
5
+ ## Archivos
6
+
7
+ - `handler.py`: handler custom para el endpoint.
8
+ - `requirements.txt`: dependencias adicionales.
9
+ - `deploy_endpoint.py`: script de referencia para desplegar el endpoint dedicado.
10
+
11
+ ## Configuracion esperada
12
+
13
+ - Repo destino en HF: `juan4pro12/vibevoice-custom-handler`
14
+ - Endpoint dedicado protegido con token
15
+ - Hardware: `nvidia-t4` / `small`
16
+ - Task: `custom`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
handler.py ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from transformers import AutoProcessor, VibeVoiceAsrForConditionalGeneration
3
+
4
+
5
+ class EndpointHandler:
6
+ def __init__(self, path: str = ""):
7
+ self.processor = AutoProcessor.from_pretrained(path)
8
+ self.model = VibeVoiceAsrForConditionalGeneration.from_pretrained(
9
+ path,
10
+ torch_dtype=torch.float16,
11
+ device_map="auto",
12
+ )
13
+
14
+ def __call__(self, data):
15
+ inputs_data = data.pop("inputs", data)
16
+ inputs = self.processor(audio=inputs_data, return_tensors="pt").to(
17
+ self.model.device,
18
+ self.model.dtype,
19
+ )
20
+
21
+ with torch.no_grad():
22
+ generated_ids = self.model.generate(**inputs)
23
+
24
+ transcription = self.processor.batch_decode(generated_ids, skip_special_tokens=True)
25
+ return {"text": transcription[0]}
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ torch
2
+ transformers>=5.3.0
3
+ accelerate
4
+ soundfile
5
+ librosa