| | --- |
| | license: apache-2.0 |
| | datasets: |
| | - KBLab/rixvox |
| | language: |
| | - sv |
| | --- |
| | # Whisper Tiny RixVox Swedish |
| |
|
| | This is a [Whisper tiny](https://huggingface.co/openai/whisper-tiny) finetuned for Swedish using |
| | the [RixVox](https://huggingface.co/datasets/KBLab/rixvox) dataset. |
| |
|
| | Please note that this model, as every other encoder-decoder speech-to-text model, is prone to |
| | hallucinating on unexpected inputs and treats the task as translation rather than transcription. |
| | I.e your mileage may vary depending on filtering and type of data. |
| |
|
| | In this release the entire encoder was frozen. Subsequent releases will not do this **if** the |
| | generalization to other types of data (i.e not parliamentary speeches) is kept when not freezing |
| | the encoder. |
| |
|
| | ## Evaluation |
| |
|
| | * Fleurs WER: 51.68 |
| | * Fleurs WER (normalized*): 48.09 |
| | |
| | *) Normalization is done by applying the following to source and generated texts: |
| |
|
| | ``` |
| | def normalize(s): |
| | return ' '.join([ x for x in sub('[^0-9a-zåäöA-ZÅÄÖ ]', ' ', s.lower()).split() ]) |
| | ``` |
| |
|
| | ## Training |
| |
|
| | Training was done using Huggingface and Deepspeed with ZeRO stage 2. |
| |
|
| | * learning rate: 1e-5 |
| | * optimizer: CPUAdamW (Deepspeed) |
| | * lr scheduler: linear |
| | * warmup steps: 500 |
| | * per device batch size: 32 |
| | * GPUs: 8 x NVIDIA A100 40GB |
| | * total batch size: 160 |
| | * steps: 10000 |
| | * lowercase: no |
| | * fp16 |
| | * entire encoder was frozen |