Svetozar1993
/

MultilingualSTT

Automatic Speech Recognition

speech-recognition

Model card Files Files and versions

MultilingualSTT / README.md

Kaivalya1993's picture

Upload folder using huggingface_hub

72f2f96 verified 3 days ago

|

history blame contribute delete

3.25 kB

	---
	language:
	- en
	- zh
	- de
	- es
	- ru
	- ko
	- fr
	- ja
	- pt
	- tr
	- pl
	- ca
	- nl
	- ar
	- sv
	- it
	- id
	- hi
	- fi
	- vi
	- he
	- uk
	- el
	- ms
	- cs
	- ro
	- da
	- hu
	- ta
	- "no"
	- th
	- ur
	- hr
	- bg
	- lt
	- la
	- mi
	- ml
	- cy
	- sk
	- te
	- fa
	- lv
	- bn
	- sr
	- az
	- sl
	- kn
	- et
	- mk
	- br
	- eu
	- is
	- hy
	- ne
	- mn
	- bs
	- kk
	- sq
	- sw
	- gl
	- mr
	- pa
	- si
	- km
	- sn
	- yo
	- so
	- af
	- oc
	- ka
	- be
	- tg
	- sd
	- gu
	- am
	- yi
	- lo
	- uz
	- fo
	- ht
	- ps
	- tk
	- nn
	- mt
	- sa
	- lb
	- my
	- bo
	- tl
	- mg
	- as
	- tt
	- haw
	- ln
	- ha
	- ba
	- jw
	- su
	license: apache-2.0
	tags:
	- whisper
	- speech-recognition
	- multilingual
	- automatic-speech-recognition
	pipeline_tag: automatic-speech-recognition
	library_name: transformers
	widget:
	- example_title: Librispeech sample 1
	src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
	- example_title: Librispeech sample 2
	src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
	---

	# MultilingualSTT

	OpenAI's Whisper Large V3 model for multilingual speech-to-text transcription.

	## Model Description

	Whisper Large V3 is a state-of-the-art automatic speech recognition (ASR) model that supports 99+ languages. It provides highly accurate transcription across a wide range of languages and acoustic conditions.

	## Key Features

	- 99+ Languages: Supports English, Chinese, German, Spanish, Russian, Korean, French, Japanese, Portuguese, Turkish, Polish, Italian, Hindi, Arabic, and many more
	- Speech Translation: Can translate speech to English
	- Timestamps: Supports word-level and sentence-level timestamps
	- Robust: Excellent handling of accents, background noise, and technical language

	## Usage

	```python
	import torch
	from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

	device = "cuda:0" if torch.cuda.is_available() else "cpu"
	torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

	model_id = "Svetozar1993/MultilingualSTT"

	model = AutoModelForSpeechSeq2Seq.from_pretrained(
	model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
	)
	model.to(device)

	processor = AutoProcessor.from_pretrained(model_id)

	pipe = pipeline(
	"automatic-speech-recognition",
	model=model,
	tokenizer=processor.tokenizer,
	feature_extractor=processor.feature_extractor,
	torch_dtype=torch_dtype,
	device=device,
	)

	result = pipe("audio.mp3")
	print(result["text"])
	```

	## Advanced Usage

	### Specify Language
	```python
	result = pipe(sample, generate_kwargs={"language": "french"})
	```

	### Speech Translation (to English)
	```python
	result = pipe(sample, generate_kwargs={"task": "translate"})
	```

	### Get Timestamps
	```python
	result = pipe(sample, return_timestamps=True)
	print(result["chunks"])
	```

	### Word-Level Timestamps
	```python
	result = pipe(sample, return_timestamps="word")
	```

	## Flash Attention 2

	For faster inference:
	```bash
	pip install flash-attn --no-build-isolation
	```

	```python
	model = AutoModelForSpeechSeq2Seq.from_pretrained(
	model_id,
	torch_dtype=torch_dtype,
	low_cpu_mem_usage=True,
	attn_implementation="flash_attention_2"
	)
	```

	## Model Details

	- Architecture: Whisper Large V3
	- Parameters: 1.55B
	- Languages: 99+
	- License: Apache 2.0

	## Author

	Svetozar1993