| | --- |
| | license: mit |
| | language: et |
| | tags: |
| | - audio |
| | - automatic-speech-recognition |
| | |
| | |
| | |
| | |
| | |
| | pipeline_tag: automatic-speech-recognition |
| | base_model: |
| | - openai/whisper-large-v3 |
| | library_name: transformers |
| | --- |
| | |
| | ## Introduction |
| |
|
| | This model is OpenAI Whisper large-v3, finetuned on ~770 hours of manually created subtitles from Estonian TV (ETV). |
| | Therefore, this model does not always create verbatim (word-by-word) subtitles but often rephrases the sentences and |
| | compresses text, especially in the case of spontaneous speech, hestitations, repetitions, etc. However, the length |
| | of the generated text chunks almost always conforms to the ETV subtitle requirements (48 characters per line). |
| |
|
| | ## Usage |
| |
|
| |
|
| |
|
| | It's a finetuned vesion of Whisper large-v3-turbo and can be therefore used via Hugging Face 🤗 Transformers. To run the model, first install the Transformers |
| | library. For this example, we'll also install 🤗 Accelerate to reduce the model loading time: |
| |
|
| | ```bash |
| | pip install --upgrade pip |
| | pip install --upgrade transformers accelerate |
| | ``` |
| |
|
| | The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) |
| | class to transcribe audios of arbitrary length: |
| |
|
| | ```python |
| | import torch |
| | from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline |
| | from datasets import load_dataset |
| | |
| | |
| | device = "cuda:0" if torch.cuda.is_available() else "cpu" |
| | torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 |
| | |
| | model_id = "TalTechNLP/whisper-large-v3-et-subs" |
| | |
| | model = AutoModelForSpeechSeq2Seq.from_pretrained( |
| | model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True |
| | ) |
| | model.to(device) |
| | |
| | processor = AutoProcessor.from_pretrained(model_id) |
| | |
| | pipe = pipeline( |
| | "automatic-speech-recognition", |
| | model=model, |
| | tokenizer=processor.tokenizer, |
| | feature_extractor=processor.feature_extractor, |
| | torch_dtype=torch_dtype, |
| | device=device, |
| | ) |
| | |
| | audio = "sample.mp3" |
| | |
| | result = pipe(sample, generate_kwargs={"task": "transcribe", "language": "et"}) |
| | print(result) |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | ``` |
| | @inproceedings{fedorchenko-2025-optimizing, |
| | title = "Optimizing Estonian {TV} Subtitles with Semi-supervised Learning and {LLMs}", |
| | author = {Fedorchenko, Artem and Alum{\"a}e, Tanel}, |
| | booktitle = "Proceedings of the 25th Nordic Conference on Computational Linguistics (NoDaLiDa)", |
| | year = "2025" |
| | } |
| | ``` |
| |
|