|
|
--- |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
- de |
|
|
- es |
|
|
- ru |
|
|
- ko |
|
|
- fr |
|
|
- ja |
|
|
- pt |
|
|
- tr |
|
|
- pl |
|
|
- ca |
|
|
- nl |
|
|
- ar |
|
|
- sv |
|
|
- it |
|
|
- id |
|
|
- hi |
|
|
- fi |
|
|
- vi |
|
|
- he |
|
|
- uk |
|
|
- el |
|
|
- ms |
|
|
- cs |
|
|
- ro |
|
|
- da |
|
|
- hu |
|
|
- ta |
|
|
- "no" |
|
|
- th |
|
|
- ur |
|
|
- hr |
|
|
- bg |
|
|
- lt |
|
|
- la |
|
|
- mi |
|
|
- ml |
|
|
- cy |
|
|
- sk |
|
|
- te |
|
|
- fa |
|
|
- lv |
|
|
- bn |
|
|
- sr |
|
|
- az |
|
|
- sl |
|
|
- kn |
|
|
- et |
|
|
- mk |
|
|
- br |
|
|
- eu |
|
|
- is |
|
|
- hy |
|
|
- ne |
|
|
- mn |
|
|
- bs |
|
|
- kk |
|
|
- sq |
|
|
- sw |
|
|
- gl |
|
|
- mr |
|
|
- pa |
|
|
- si |
|
|
- km |
|
|
- sn |
|
|
- yo |
|
|
- so |
|
|
- af |
|
|
- oc |
|
|
- ka |
|
|
- be |
|
|
- tg |
|
|
- sd |
|
|
- gu |
|
|
- am |
|
|
- yi |
|
|
- lo |
|
|
- uz |
|
|
- fo |
|
|
- ht |
|
|
- ps |
|
|
- tk |
|
|
- nn |
|
|
- mt |
|
|
- sa |
|
|
- lb |
|
|
- my |
|
|
- bo |
|
|
- tl |
|
|
- mg |
|
|
- as |
|
|
- tt |
|
|
- haw |
|
|
- ln |
|
|
- ha |
|
|
- ba |
|
|
- jw |
|
|
- su |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- whisper |
|
|
- speech-recognition |
|
|
- multilingual |
|
|
- automatic-speech-recognition |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
library_name: transformers |
|
|
widget: |
|
|
- example_title: Librispeech sample 1 |
|
|
src: https://cdn-media.huggingface.co/speech_samples/sample1.flac |
|
|
- example_title: Librispeech sample 2 |
|
|
src: https://cdn-media.huggingface.co/speech_samples/sample2.flac |
|
|
--- |
|
|
|
|
|
# MultilingualSTT |
|
|
|
|
|
OpenAI's Whisper Large V3 model for multilingual speech-to-text transcription. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
Whisper Large V3 is a state-of-the-art automatic speech recognition (ASR) model that supports 99+ languages. It provides highly accurate transcription across a wide range of languages and acoustic conditions. |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- **99+ Languages**: Supports English, Chinese, German, Spanish, Russian, Korean, French, Japanese, Portuguese, Turkish, Polish, Italian, Hindi, Arabic, and many more |
|
|
- **Speech Translation**: Can translate speech to English |
|
|
- **Timestamps**: Supports word-level and sentence-level timestamps |
|
|
- **Robust**: Excellent handling of accents, background noise, and technical language |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline |
|
|
|
|
|
device = "cuda:0" if torch.cuda.is_available() else "cpu" |
|
|
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 |
|
|
|
|
|
model_id = "Svetozar1993/MultilingualSTT" |
|
|
|
|
|
model = AutoModelForSpeechSeq2Seq.from_pretrained( |
|
|
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True |
|
|
) |
|
|
model.to(device) |
|
|
|
|
|
processor = AutoProcessor.from_pretrained(model_id) |
|
|
|
|
|
pipe = pipeline( |
|
|
"automatic-speech-recognition", |
|
|
model=model, |
|
|
tokenizer=processor.tokenizer, |
|
|
feature_extractor=processor.feature_extractor, |
|
|
torch_dtype=torch_dtype, |
|
|
device=device, |
|
|
) |
|
|
|
|
|
result = pipe("audio.mp3") |
|
|
print(result["text"]) |
|
|
``` |
|
|
|
|
|
## Advanced Usage |
|
|
|
|
|
### Specify Language |
|
|
```python |
|
|
result = pipe(sample, generate_kwargs={"language": "french"}) |
|
|
``` |
|
|
|
|
|
### Speech Translation (to English) |
|
|
```python |
|
|
result = pipe(sample, generate_kwargs={"task": "translate"}) |
|
|
``` |
|
|
|
|
|
### Get Timestamps |
|
|
```python |
|
|
result = pipe(sample, return_timestamps=True) |
|
|
print(result["chunks"]) |
|
|
``` |
|
|
|
|
|
### Word-Level Timestamps |
|
|
```python |
|
|
result = pipe(sample, return_timestamps="word") |
|
|
``` |
|
|
|
|
|
## Flash Attention 2 |
|
|
|
|
|
For faster inference: |
|
|
```bash |
|
|
pip install flash-attn --no-build-isolation |
|
|
``` |
|
|
|
|
|
```python |
|
|
model = AutoModelForSpeechSeq2Seq.from_pretrained( |
|
|
model_id, |
|
|
torch_dtype=torch_dtype, |
|
|
low_cpu_mem_usage=True, |
|
|
attn_implementation="flash_attention_2" |
|
|
) |
|
|
``` |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Architecture:** Whisper Large V3 |
|
|
- **Parameters:** 1.55B |
|
|
- **Languages:** 99+ |
|
|
- **License:** Apache 2.0 |
|
|
|
|
|
## Author |
|
|
|
|
|
Svetozar1993 |
|
|
|